Progress 100% but not finishing

dlsizer · May 8, 2013, 1:35pm

I have several backup jobs that keep reaching 100% on the progress bar, but show hundreds of thousands of files remaining in queue. It is slowly counting down on the queue, but at the current pace, it will likely not finish before the same job would be scheduled to begin again.

Am I dealing with an I/O bottleneck at this point? What would cause the above scenario, where the job reaches 100%, but clearly hasn’t finished?

uroni · May 8, 2013, 10:54pm

At this point it is looking up the files in the database and linking them if found, moving them if not. So it might be limited by the database lookup part, or the linking part. The database should be on a fast enough volume (random access). With 1.2 you can increase the cache size in Settings->advanced. Choosing a higher number (as appropriate) here might help. It won’t start another backup for the same client till this is finished.

It’s not putting any warnings into the logfile (/var/log/urbackup.log), is it?

dlsizer · May 10, 2013, 1:54pm

The only error message that seems potentially interesting is this one:

ERROR: SQLITE: Long running query Stmt: [INSERT INTO files (backupid, fullpath, hashpath, shahash, filesize, created, rsize, did_count, clientid, incremental) SELECT backupid, fullpath, hashpath, shahash, filesize, created, rsize, 0 AS did_count, clientid, incremental FROM files_tmp]

The database exists on a fast set of drives (SAS 6G, 7200RPM, RAID10), though it only has an older dual core proc. Although the CPU utilization is usually quite low, currently it is sitting at 80-95% with the same three jobs sitting at 100% but with hundreds of thousands of files in queue for each job. Other jobs ran and completed normally while these have been in this state for over two days. The queued files list is slowly counting down.

With your assistance, I recently resolved my ever expanding WAL file. Two days later now, the WAL file has grown to over 13G again. I can only assume this is related to these “stuck” jobs. The DB has also grown by almost 2G in the last 2 days. Also of interest is that these same servers ran these same configuration of jobs for months without error, then got stuck like this. My only solution in the past was to delete the urbackup installation, install a clean copy, move all of the data, and start over. Again, everything ran beautifully for a couple months, until now. It seems as the database and wal file begin to grow out of control, the jobs take longer and longer until they reach this state.

uroni · May 10, 2013, 9:14pm

Okay. So it is a database performance issue. Could you run these queries on backup_server.db and post the results?

SELECT COUNT(*) FROM files;
SELECT COUNT(*) FROM files WHERE did_count=0;
SELECT COUNT(*) FROM (SELECT 1 FROM files GROUP BY shahash);

You obviously have file backups with a lot of files.
To fix the problem I might have to add further optimizations to the part of UrBackup which saves the hashes. As a short term fix you could:

Disable or increase the interval of full file backups (hashes are only saved for those)
Delete the contents of the “files” table. This will cause UrBackup to download and save all files of all clients at the next full file backup (no hard linking), so do that only if you have enough storage for that.
Steps: Shut down UrBackup. Change journal mode (PRAGMA journal_mode=DELETE) and delete the file entries (DELETE FROM files;). Afterwards you can shrink the db (VACUUM;).

dlsizer · May 13, 2013, 8:30pm

Results of the queries:

sqlite> select count() from files;
14563470
sqlite> select count() from files where did_count=0;
6074425
sqlite> select count(*) from (select 1 from files group by shahash);
742433

I will try adjusting the intervals for full file backups and work with incrementals to see if that helps. Considering the size of my data sets (which are not currently all inclusive - I have more servers to add to this environment), should I be focusing more on incremental over full file backups?

By the way, two of the three jobs are still running and slowly counting down files, while the database and wal continue to grow (wal has reached 24G now). Pressing the STOP button next to the jobs has no effect whatsoever.

uroni · May 14, 2013, 8:13pm

Yes, this would probably help. As said, UrBackup only adds file entries for unchanged files to the database if it is running a full backup. Those file entries are causing your performance problems. I’m working on a file entry chache.