UrBackup crashing

Hi. We were running UrBackup Server (1.4.x series) on Centos 6.6 (ext4 for / and xfs file system for storage of backups) for well over a year (Dell PowerEdge 730xd). Occasionally we would see the server process die with the warning that the UrBackup Server died but the PID remained. We would then simply re-start the UrBackup Server process. It would happen only occasionally (maybe once every 2 months).

Since we’ve upgraded to UrBackup Server (2.0.33) (no issues upgrading) we’re now seeing daily crashes. In looking at the UrBackup server logs we see lots of the following errors:

2016-09-12 04:37:04: WARNING: SQLite: disk I/O error errorcode: 522
2016-09-12 04:37:04: ERROR: Error preparing Query [PRAGMA cache_size = -2048]: disk I/O error
2016-09-12 04:37:04: WARNING: SQLite: statement aborts at 4: [ATTACH DATABASE ‘urbackup/backup_server_settings.db’ AS settings_db] disk I/O error errorcode: 522
2016-09-12 04:37:04: ERROR: Error in CQuery::Execute - disk I/O error Stmt: [ATTACH DATABASE ‘urbackup/backup_server_settings.db’ AS settings_db]
2016-09-12 04:37:04: ERROR: SQL: disk I/O error Stmt: [ATTACH DATABASE ‘urbackup/backup_server_settings.db’ AS settings_db]
2016-09-12 04:37:04: Starting incremental file backup…
2016-09-12 04:37:04: WARNING: SQLite: disk I/O error errorcode: 522
2016-09-12 04:37:04: ERROR: Error preparing Query [SELECT incremental, path, resumed, complete, id FROM backups WHERE clientid=? AND tgroup=? AND done=1 ORDER BY backuptime DESC LIMIT 1]: disk I/O error. Retrying in 1s…
2016-09-12 04:37:05: WARNING: SQLite: disk I/O error errorcode: 522
2016-09-12 04:37:05: ERROR: Error preparing Query [SELECT incremental, path, resumed, complete, id FROM backups WHERE clientid=? AND tgroup=? AND done=1 ORDER BY backuptime DESC LIMIT 1]: disk I/O error. Retrying in 1s…
2016-09-12 04:37:06: WARNING: SQLite: disk I/O error errorcode: 522
2016-09-12 04:37:06: ERROR: Error preparing Query [SELECT incremental, path, resumed, complete, id FROM backups WHERE clientid=? AND tgroup=? AND done=1 ORDER BY backuptime DESC LIMIT 1]: disk I/O error. Retrying in 1s…
2016-09-12 04:37:07: WARNING: SQLite: disk I/O error errorcode: 522
2016-09-12 04:37:07: ERROR: Error preparing Query [SELECT incremental, path, resumed, complete, id FROM backups WHERE clientid=? AND tgroup=? AND done=1 ORDER BY backuptime DESC LIMIT 1]: disk I/O error. Retrying in 1s…
2016-09-12 04:37:08: WARNING: SQLite: disk I/O error errorcode: 522
2016-09-12 04:37:08: ERROR: Error preparing Query [SELECT incremental, path, resumed, complete, id FROM backups WHERE clientid=? AND tgroup=? AND done=1 ORDER BY backuptime DESC LIMIT 1]: disk I/O error. Retrying in 1s…
2016-09-12 04:37:09: WARNING: SQLite: disk I/O error errorcode: 522
2016-09-12 04:37:09: ERROR: Error preparing Query [SELECT incremental, path, resumed, complete, id FROM backups WHERE clientid=? AND tgroup=? AND done=1 ORDER BY backuptime DESC LIMIT 1]: disk I/O error

I ran the urbackupsrv repair-database and it found a few issues which it fixed. Ran for a day without issue (no crashes). The we saw the crashes daily again. Same errors. There’s plenty of disk space, no CPU issues or memory issues and the load on the server is not high when this occurs. Disks seem to be OK, but perhaps a server reboot and diskcheck would be warranted.

Upon starting the server I see:

2016-09-12 05:05:32: Starting HTTP-Server on port 55414
2016-09-12 05:05:32: HTTP: Server started up successfully!
2016-09-12 05:05:32: SQLite: recovered 12772 frames from WAL file /usr/local/var/urbackup/backup_server_files.db-wal code: 283
2016-09-12 05:05:33: SQLite: recovered 4602 frames from WAL file /usr/local/var/urbackup/backup_server_link_journal.db-wal code: 283
2016-09-12 05:05:34: SQLite: recovered 61009 frames from WAL file /usr/local/var/urbackup/backup_server_links.db-wal code: 283
2016-09-12 05:05:34: SQLite: recovered 1 frames from WAL file /usr/local/var/urbackup/backup_server_settings.db-wal code: 283
2016-09-12 05:05:35: Started UrBackup…
2016-09-12 05:05:35: Removing temporary files…
2016-09-12 05:05:35: Recreating temporary folder…
2016-09-12 05:05:35: Backup destination cannot handle subvolumes and snapshots. Snapshots disabled.
2016-09-12 05:05:35: Broadcasting on interface IP 172.20.161.26
2016-09-12 05:05:35: Broadcasting on interface IP 172.20.161.233
2016-09-12 05:05:35: Broadcasting on interface IP 169.254.0.2
2016-09-12 05:05:35: InternetService: Server started up successfully!
2016-09-12 05:05:35: UrBackup Server start up complete.

and off we go, backups start and we wait for the next crash.

Any advice appreciated on how to proceed or diagnose. Thanks.

Pete

Could you attach ptrace (with ptrace -f -p PID) and capture that? This could be a SQLite issue which needs to be reported. Also of interest is your Linux kernel version and any disk drivers involved in writing to the database (/var/urbackup).

Thanks for the quick response. Our Centos 6.6 kernel is 2.6.32-504.12.2.el6.x86_64. I have attached what I hope is the disk driver information and an strace. I fear that the strace will not show what you need to see as the issue is currently not occurring. Do I need to leave the strace runnng all day? I imagine that the strace log file will grow pretty big.

Petemodinfo.txt (1.7 KB)
strace_urbackupsrv.txt (265.5 KB)

Yeah, strace needs to capture the error.

One other has the problem with a HP 380 G9 RAID5 SQLite: disk I/O error errorcode: 522 Backup Server idk which RAID controller/driver that is. Do you use a RAID controller?

Yes, we do use a raid controller.

03:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS-3 3108 [Invader] (rev 02)
Kernel driver in use: megaraid_sas
Kernel modules: megaraid_sas

I will try to capture the strace when the error occurs. Given that it happens randomly during the day/night this may take a while.

Urbackup worked but until now always with the RAID controller only after the update to 2.0.31 no longer!!!

Error 522 implies your SQLite DB is corrupt (or your controller/disk/OS is returning less data than expected/requested).

“Disk seems to be OK” - don’t trust a singular disk, check your SMART status for unrecoverable errors or if you insist on a hardware controller, make sure issues don’t get filtered in there somewhere. The SQLite database gets hammered quite a bit, try moving your databases to another disk (eg. a flash drive) and see if the issues persist.

With LSI controllers, there could also be an issue if your controller firmware and driver mismatches or simply a bug in firmware, make sure you’re running the latest version of their firmware and driver.

Hi,
Finally caught another crash using strace (I believe). I have attached a zipped file of the strace. Thoughts? Thanks.

Petestrace_urbackupsrv.zip (1.1 MB)

It does not seem to log the errors in the first post in this strace. Are you sure you captured this issue?

The strace was running when the service died. I will try again.

Pete

Are you using the command line sqlite3 tool to do something with the database?

We are using a set of scripts that run sqlite3 to query the database to pull more error information for us. We do not write to the database, only read. As of yesterday, I have disabled these scripts to see if that resolves the issue. Now just waiting to see if it happens again. Is there an issue with reading the database with sqlite3?

Someone else thinks the errors might be caused by this, so we’ll see.

Hi,
I believe that I have just the same error. I have some scripts running from a Nagios server that checks the age of the backups of different servers. It seemed to work well with UrBackup version 1.4.

I also attached urbackupsrv to GDB, and it seemed to crash on a prepared statement being NULL. I did not save the output…

Anyways, I hope this contains the error: urbackup_strace.zip (6.8 KB)

I believe I have found the issue. Thanks for all the help!

Fix will be in server version 2.0.35.

Hi,
This is great news! Looking forward to 2.0.35.