Thanks for the reply.
Well, in my opinion that assumption is not correct. First of all, most people won’t use a top-class power-hungry CPU for a simple backup server. So looking at the current average or low-end CPUs, maximum SHA-256 speed is at most a couple of hundreds of MB per second, which is still very significant when compared to typical combined I/O speeds (net, disk, etc.) which are also in this range. If hashing speed were 20x combined I/O speed, that would be insignificant indeed, but it is not.
In my case, I tried to set it up on an ARM-based NAS and an Intel CoreDuo motherboard which I had lying around here. On the NAS, it doesn’t get above a ridiculous 15 MB/s. The Intel CPU was a lot speedier, but it was still pulling the CPU near 100%. What makes a setup using a typical NAS so interesting, is that power consumption is extremely low yet most of these boxes easily achieve writing speeds of 80-100 MB/s in addition to small physical dimensions. The problem however is that URBackup is computing hashes like mad.
And even if you would use a modern and expensive CPU, then I’m pretty sure you can easily max it out by using a fast disk array and multiple network cards in parallel.
For astronomical SHA-256 speed, I guess we’ll have to wait for the next Intel CPU generation which apparently will come with hardware built-in SHA-256.
My second thought would be why it unconditionally needs to check client hashes on the server after all? There are indeed some situations where this could be required, such as lossy internet transfers etc., but in a typical well-configured local net packet errors are virtually zero, if any. You could further argue the server can’t “trust” any hashes sent by the clients so it has to compute it by itself, but then again, you could say the exact same thing regarding the data blocks themselves. There is also no guarantee whatsoever the client is offering correct data in its backup blocks - you simply have to trust it pulled them error-free of its drive.
In short, I think the use of the mandatory server hashing scheme is based on two assumptions here:
- we can’t trust the client computing hashes
- we can’t trust the network, including TCP/IP’s checksum mechanism which is indeed unreliable for large amounts of data in networks with high losses.
But the price you pay for it is equally high: you effectively bottleneck your server in this concept, assuming a fast CPU would get the job done anyway.
I think it would be a great addition to URBackup if the user could at least configure it in such way that it:
- either uses no hashing at all
- either the server is able to receive and store the hashes computed by the client, without verifying or recomputing them
The latter solution is much preferred because it will not only distribute the hashing load over the clients making the system much more scalable, computing no hashes at all disables any incremental backup mechanism.