How Does The Client Detect File Changes?

Moisie · July 9, 2020, 1:49pm

Hello:

Sorry - I’ve scanned the documentation and the forum - but I can’t find an answer to this:

I’m curious about how the client detects newly modified files to backup. From looking at system logs, it appears that even (some?) unchanged files get re-hashed during an Internet incremental backup - e.g.

2020-07-09 14:21:40: Parallel hash "/usr/share/zsh/5.2/functions/_arrays" id=301559 hash=

I also see high CPU usage from the backend process during a backup.

Does this imply that the client is entirely rehashing the backup set during each incremental backup? If so, is there a reliable way to reduce the resource impact (CPU usage and disk I/O usage) within Urbackup?

Many thanks!

uroni · July 9, 2020, 2:22pm

The file is too small, and doesn’t get hashed at all.

A good first step to analyse the CPU/IO impact would be to look at the thread names (e.g. in htop – enable in settings; iotop shows it per default). That way you’d now if the hashing or transfer is causing it.

The ultimate way is to install debug symbols (install the -dbg package), then run perf top -g.

Ok, sorry, didn’t read your username. Idk how one would do it on macOS You might also want to make sure the crypto++ is compiled with all the CPU acceleration active (AES, SHA etc.)

Moisie · July 9, 2020, 4:11pm

Thanks for the quick reply.

So how does the software determine that an existing file needs to be backed up afresh? Given a file of sufficient size, is it always rehashed entirely on each incremental backup pass, and compared with the stored hashes?

Haha! I’ll check on the compile flags for that - thanks - but my concern is as much about disk I/O and time required for backup, as it is about CPU usage. My deployment target would be multi-TB production workstations, and the system would not cope with entirely re-reading and hashing 12TB of data on each backup cycle.

My existing backup software relies on fsevents (Time Machine), or changes in mtime or size; I realise that’s not a foolproof and completely robust way to catch modifications - but it seems to catch the vast majority of changes, with a vastly reduced resource load.

EDIT: I realise that fsevents - of course - are not supported on macOS. It’s on my wishlist.

uroni · July 9, 2020, 4:38pm

It compares mtime+ctime-

This would require an implementation of ChangeJournalWatcher.cpp for macOS. Maybe there is something better in APFS to display the changes to a certain point in time/transaction (it is CoW-Metadata after all…).

Moisie · July 9, 2020, 6:01pm

So - please forgive me if I’m just looking too closely at log files and leaping to wrong conclusions from them - is it not to be expected that, on a system doing Internet incremental backups, where a full backup was completed yesterday and has hardly been touched since (other than running backups), I would expect to see a lot of system log entries like this? (Because that’s what I am seeing…)

2020-07-09 18:00:47: Parallel hash "/Applications/Safari.app/Contents/PkgInfo" id=49530 hash=
2020-07-09 18:00:47: Parallel hash "/Applications/Safari.app/Contents/version.plist" id=49531 hash=
2020-07-09 18:00:47: Parallel hash "/Applications/Safari.app/Contents/MacOS/Safari" id=49533 hash=+kVYAbJ1OTRi3VBEmylZFTCr76K9ANLEr2I4MDu0zBvgvsdHAQDQAwEAAAABAAAAAQAAAAEAAAABAAAAAQAAAA--
2020-07-09 18:00:47: Parallel hash "/Applications/Safari.app/Contents/PlugIns/CacheDeleteExtension.appex/Contents/Info.plist" id=49539 hash=
2020-07-09 18:00:47: Parallel hash "/Applications/Safari.app/Contents/PlugIns/CacheDeleteExtension.appex/Contents/version.plist" id=49540 hash=
2020-07-09 18:00:47: Parallel hash "/Applications/Safari.app/Contents/PlugIns/CacheDeleteExtension.appex/Contents/MacOS/CacheDeleteExtension" id=49542 hash=SvHp1ijYJJ+zrUxHuLoGaRelVTbbMEuPmNNq3geHrtXbr0PZLLpR3wVKOnoBAFAFAQAAAAEAAAABAAAAAQAAAA--
2020-07-09 18:00:47: Parallel hash "/Applications/Safari.app/Contents/PlugIns/CacheDeleteExtension.appex/Contents/_CodeSignature/CodeResources" id=49545 hash=wFujiBPUxnJ3KLuVQrFZrWPJ2SoBAAAAAQAAAAEAAAABAAAAAQAAAAEAAAABAAAAAQAAAAEAAAABAAAAAQAAAA--

Safari hasn’t been touched since the previous backups (running at hourly intervals) - but it looks like the files are being rehashed each time.

Of course the other possibility is that this is a quirk of the macOS build I’m running, and that this is not expected behaviour on a supported platform - so presumably something about how macOS is reporting the timestamps is off?

Indeed - I’m building up to that…

Unfortunately APFS is not yet well documented for developers - so I’m not sure such things are accessible from a public API.

Many thanks!

uroni · July 9, 2020, 6:23pm

Indeed, maybe you can further track this down by looking if the “change indicator” of Safari has changed between backups in the respective *.ub files (i.e. grep ‘Safari"’ *.ub on the server and on the client). For unchanged files it puts the hash directly in the .ub file if parallel hashing is enabled.

Also perhaps try disabling parallel hashing, to check if it’s only a problem with that…

Moisie · July 10, 2020, 8:13am

Hi:

I’ve grepped for Safari on both client and server - results attached, but I don’t know how to interpret this.

Will try disabling parallel hashing now, and will report back.

Thanks!
servergrep.txt (59.5 KB) clientgrep.txt (18.8 KB)

Moisie · July 10, 2020, 12:27pm

Ok - with parallel hashing disabled, I’m seeing a much lower CPU usage on my (macOS) client, and only instances of hashing where it’s understandable that a file will have changed.

Re-enabling parallel hashing, and I see a big bunch of Parallel hash messages in the log again.

I’ve yet to dive in deep, but it looks like these messages are only for small files - perhaps under the hashing limit?

uroni · July 10, 2020, 1:55pm

Yeah, an optimization where it doesn’t pipe them through the parallel hashing might be good, I guess. Your Safari example is above the hashing limit, tough, so it might not save the hashes properly.