Unscalable image backup process

A couple of days ago I stumbled accross the UrBackup project, and I thought it looked like a pretty neat project to use for local image backups. But my enthusiasm quickly cooled down: after an extended series of tests UrBackup seems to be unable to handle anything more than a couple of workstations. It simply maxes out the server CPU.

After digging into some of its source code rather quickly and as far as I understand it, not being familiar with it, it looks like it is computing tons and tons of hashes on the server side when making full image backups, even in RAW transfer mode. Wouldn’t it be way more scalable if these hashes were computed by the clients?

With modern processors the hashing shouldn’t be the problem. Most often it is IO limited. What kind of processor do you use? There may also be options to optimize the hashing.

It needs to compute the hashes on the server to check the client hashes.

Thanks for the reply.

Well, in my opinion that assumption is not correct. First of all, most people won’t use a top-class power-hungry CPU for a simple backup server. So looking at the current average or low-end CPUs, maximum SHA-256 speed is at most a couple of hundreds of MB per second, which is still very significant when compared to typical combined I/O speeds (net, disk, etc.) which are also in this range. If hashing speed were 20x combined I/O speed, that would be insignificant indeed, but it is not.

In my case, I tried to set it up on an ARM-based NAS and an Intel CoreDuo motherboard which I had lying around here. On the NAS, it doesn’t get above a ridiculous 15 MB/s. The Intel CPU was a lot speedier, but it was still pulling the CPU near 100%. What makes a setup using a typical NAS so interesting, is that power consumption is extremely low yet most of these boxes easily achieve writing speeds of 80-100 MB/s in addition to small physical dimensions. The problem however is that URBackup is computing hashes like mad.

And even if you would use a modern and expensive CPU, then I’m pretty sure you can easily max it out by using a fast disk array and multiple network cards in parallel.

For astronomical SHA-256 speed, I guess we’ll have to wait for the next Intel CPU generation which apparently will come with hardware built-in SHA-256.

My second thought would be why it unconditionally needs to check client hashes on the server after all? There are indeed some situations where this could be required, such as lossy internet transfers etc., but in a typical well-configured local net packet errors are virtually zero, if any. You could further argue the server can’t “trust” any hashes sent by the clients so it has to compute it by itself, but then again, you could say the exact same thing regarding the data blocks themselves. There is also no guarantee whatsoever the client is offering correct data in its backup blocks - you simply have to trust it pulled them error-free of its drive.

In short, I think the use of the mandatory server hashing scheme is based on two assumptions here:

  1. we can’t trust the client computing hashes
  2. we can’t trust the network, including TCP/IP’s checksum mechanism which is indeed unreliable for large amounts of data in networks with high losses.
    But the price you pay for it is equally high: you effectively bottleneck your server in this concept, assuming a fast CPU would get the job done anyway.

I think it would be a great addition to URBackup if the user could at least configure it in such way that it:

  • either uses no hashing at all
  • either the server is able to receive and store the hashes computed by the client, without verifying or recomputing them

The latter solution is much preferred because it will not only distribute the hashing load over the clients making the system much more scalable, computing no hashes at all disables any incremental backup mechanism.

Anyone having the same experiences? Slow backups? I’ve meanwhile put UrBackup on a QNAP TS-251. This is a very powerful NAS unlike the ARM-based NASes. It makes things much quicker, but it again proves it’s wrong to assume I/O is the bottleneck anyway. A single client pulls up CPU load to about 80%. When I put both network ports in trunk mode (2GBit p.s.), it simply saturates the CPU. In summary, as pointed out before:

  • The user should have the option of disabling the complete hashing mechanism for local image backups, or better have an option hashes should be precomputed by the clients only and stored on the server as-given so the server is not easily saturated due to performing all hash computations of all clients.

  • Every single improvement in the hashing performance in general would be very welcome.

But the way it is now, in my opinion, the trade-off between hashing and correctness on the server side is simply not justified. It can’t scale.

No. My test bed runs dual Xeon X5650’s, 32GB RAM and I can receive 50-80MB/s with no issue. Six disks in a RAID5 and I am majorly limited by the I/O on the disks. CPU usage is never above 3%. We saw the same thing on older machines running dual Xeon E5450’s and 16GB RAM with a similar disk layout.

No. Those boxes achieve those speeds because they have hardware NICs. They also cannot keep it up after they saturate their cache. They fall off quickly once that occurs. You have to look at sustained writes. Write a TB of data and see what your average throughput is. Once they have to start calculating parity after they fill their cache, the speeds plummet.

First rule of programming: Trust no data. You have to verify it. Especially if you are going to do incrementals. Otherwise, a little bit of corrupted data will cause severe data loss.

If you don’t hash on both ends and compare, how are you going to know if you have good data or bad? How do you know if you can trust it? A hash on one side is less than useless, because if you just implicitly trust it for the purpose of knowing what is on disk, then you could be corrupting multiple backups.

Not in my experience. If you have the capability to read raw TCP/IP, check the logging and see how many NAK errors you get. Each one of those is potential loss if you just implicitly trust the data. Not to mention that you are trying to hit a moving target. Updates break functionality all the time. Recently I had some backups fail due to deduplication on the client-side. If not for the hashes, I would have thought I had good data on disk. Instead, I knew to exclude some directories and the next backup was good.

Trust, but verify. An unverified backup is no backup at all.