Incremental file backup hang on lvcreate problem

Today I found a hung incremental file backup of an old debian linux virtual machine.

Looking into this hang I found two errors. My guess is that both are related to system load.

  • initially lvcreate hung
  • later “Error starting file metadata download thread”

The metadata download thread problem happens occasionally and recovers eventually with retries. The lvcreate hang is more of a problem as it required attention to clear.

Do you have any advice about these errors? I have included some of my notes below:

On the server:

root@curlew:~# urbackupsrv --version
UrBackup Server v2.4.12.0
Copyright (C) 2011-2019 Martin Raiber
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
root@curlew:~# lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux 10 (buster)
Release:        10
Codename:       buster
root@curlew:~# uname -a
Linux curlew 4.19.0-8-amd64 #1 SMP Debian 4.19.98-1 (2020-01-26) x86_64 GNU/Linux

On the client:

root@avocet:~# urbackupclientbackend --version
UrBackup Client Backend v2.4.9.0
Copyright (C) 2011-2019 Martin Raiber
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
root@avocet:~# lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux 8.11 (jessie)
Release:        8.11
Codename:       jessie
root@avocet:~# uname -a
Linux avocet 3.16.0-10-amd64 #1 SMP Debian 3.16.81-1 (2020-01-17) x86_64 GNU/Linux

The web gui log for the machine showed:

14/02/20 02:50  	INFO  	Waiting for file hashing and copying threads...
14/02/20 02:50  	INFO  	Writing new file list...
14/02/20 02:50  	INFO  	All metadata was present
14/02/20 02:50  	DEBUG  	Syncing file system...
14/02/20 02:51  	DEBUG  	Creating symbolic links. -1
14/02/20 02:51  	DEBUG  	Creating symbolic links. -2
14/02/20 02:51  	DEBUG  	Symbolic links created.
14/02/20 02:51  	INFO  	Transferred 64.6736 MB - Average speed: 9.74898 MBit/s
14/02/20 02:51  	DEBUG  	Script does not exist urbackup/post_incr_filebackup
14/02/20 02:51  	INFO  	Time taken for backing up client avocet: 2m 10s
14/02/20 02:51  	INFO  	Backup succeeded
14/02/20 06:56  	DEBUG  	avocet: Connecting for filelist...
14/02/20 06:56  	DEBUG  	avocet: Waiting for filelist
14/02/20 06:56  	DEBUG  	avocet: Connecting for filelist (async)...

On the machine being backed up I found

root@avocet:~# lvs
^C  Interrupted...
  Giving up waiting for lock.
  /run/lock/lvm/V_vg0:aux: flock failed: Interrupted system call
  Can't get lock for vg0
  Skipping volume group vg0

root@avocet:~# lsof | grep V_vg0
dmeventd   2282             root    8uW     REG               0,18          0    2171436 /run/lock/lvm/V_vg0:aux
dmeventd   2282             root    9u      REG               0,18          0    2164498 /run/lock/lvm/V_vg0
dmeventd   2282 25459       root    8uW     REG               0,18          0    2171436 /run/lock/lvm/V_vg0:aux
dmeventd   2282 25459       root    9u      REG               0,18          0    2164498 /run/lock/lvm/V_vg0
dmeventd   2282 25465       root    8uW     REG               0,18          0    2171436 /run/lock/lvm/V_vg0:aux
dmeventd   2282 25465       root    9u      REG               0,18          0    2164498 /run/lock/lvm/V_vg0
lvcreate  25423             root    5uW     REG               0,18          0    2164498 /run/lock/lvm/V_vg0

root@avocet:~# ps -p 25423 -f
UID        PID  PPID  C STIME TTY          TIME CMD
root     25423 25402  0 06:56 ?        00:00:01 lvcreate -l50%FREE -s -n urbackup_snap_8a15686e1ff0af5c33d27d36b0aba7ad7bb3d2e178bb26d2 /dev/mapper/vg0-root
root@avocet:~# ps -p 25402 -f
UID        PID  PPID  C STIME TTY          TIME CMD
root     25402 25401  0 06:56 ?        00:00:01 /bin/sh /usr/local/share/urbackup/lvm_create_filesystem_snapshot 8a15686e1ff0af5c33d27d36b0aba7ad7bb3d2e178bb26d2 / root /
root@avocet:~# ps -p 25401 -f
UID        PID  PPID  C STIME TTY          TIME CMD
root     25401   558  0 06:56 ?        00:00:00 sh -c /usr/local/share/urbackup/lvm_create_filesystem_snapshot 8a15686e1ff0af5c33d27d36b0aba7ad7bb3d2e178bb26d2 "/" "root" "/" 2>&1
root@avocet:~# ps -p 558 -f
UID        PID  PPID  C STIME TTY          TIME CMD
root       558     1  2 Feb11 ?        01:57:41 /usr/local/sbin/urbackupclientbackend --config /etc/default/urbackupclient --no-consoletime

I terminated the stuck lvcreate:

root@avocet:~# lvs
  LV                                                             VG   Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  root                                                           vg0  owi-aos---  13.97g
  swap                                                           vg0  -wi-ao---- 952.00m
  urbackup_snap_8a15686e1ff0af5c33d27d36b0aba7ad7bb3d2e178bb26d2 vg0  swi-a-s---   2.55g      root   69.03

So the snapshot seems to have been created before the hang. The backup was still waiting so I stopped it via the Activities tab of the web gui. The log then showed:

14/02/20 11:07  	WARNING  	Sever admin stopped backup during indexing
14/02/20 11:07  	ERROR  	Backup had an early error. Deleting partial backup.
14/02/20 11:07  	WARNING  	Exponential backoff: Waiting at least 40m before next file backup

When the backup started again later it got past the snapshot but failed later:

14/02/20 11:47  	INFO  	Starting scheduled incremental file backup...
14/02/20 11:47  	DEBUG  	avocet: Doing backup with hashes...
14/02/20 11:47  	DEBUG  	avocet: Doing backup with intra file diffs...
14/02/20 11:47  	DEBUG  	avocet: Connecting for filelist...
14/02/20 11:47  	DEBUG  	avocet: Waiting for filelist
14/02/20 11:47  	DEBUG  	avocet: Connecting for filelist (async)...
14/02/20 11:50  	INFO  	Logical volume "urbackup_snap_e9dee62ea3cf7cc6d38d974bfe143b6fdb133e97179a6571" created
14/02/20 11:50  	INFO  	Indexing of "root" done. 11626 filesystem lookups 0 db lookups and 0 db updates
14/02/20 11:50  	INFO  	Backing up "shares_0" without snapshot.
14/02/20 11:50  	INFO  	Indexing of "shares_0" done. 33604 filesystem lookups 0 db lookups and 0 db updates
14/02/20 11:50  	DEBUG  	avocet: Doing backup with hashed transfer...
14/02/20 11:50  	INFO  	avocet: Loading file list...
14/02/20 11:50  	DEBUG  	avocet Starting incremental backup...
14/02/20 11:50  	INFO  	avocet: Calculating file tree differences...
14/02/20 11:50  	INFO  	Waiting for metadata download stream to finish
14/02/20 11:50  	ERROR  	Error starting file metadata download thread
14/02/20 11:50  	ERROR  	Backup had an early error. Deleting partial backup.
14/02/20 11:50  	WARNING  	Exponential backoff: Waiting at least 1h 20m before next file backup

The client now showed:

root@avocet:~# lvs
  LV                                                             VG   Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  root                                                           vg0  owi-aos---  13.97g
  swap                                                           vg0  -wi-ao---- 952.00m
  urbackup_snap_8a15686e1ff0af5c33d27d36b0aba7ad7bb3d2e178bb26d2 vg0  swi-a-s---   2.55g      root   72.44
  urbackup_snap_e9dee62ea3cf7cc6d38d974bfe143b6fdb133e97179a6571 vg0  swi-aos---   1.27g      root   2.43

Hmm, to investigate further next time:

  • Could be dmeventd has the lock… try listing file locks with lslocks to check. Then kill/restart dmeventd.
  • If lvcreate hangs in the kernel running cat /proc/PID/stack can give an idea what it does (sometimes)
  • If lvcreate hangs in userspace, one option is to install the debug symbols of lvcreate, and gdb then run gdb then attach PID then thread apply all bt
  • Trace the system calls lvcreate does… modify /usr/local/share/urbackup/lvm_create_filesystem_snapshot and add strace -f before lvcreate and redirect it into a file (append > /var/log/strace.log 2>&1 to the lvcreate line)

I checked the client syslog and found:

Feb 14 06:57:29 avocet systemd-udevd[25425]: timeout '/sbin/dmsetup splitname --nameprefixes --noheadings --rows vg0-urbackup_snap_8a15686e1ff0af5c33d27d36b0aba7ad7bb3d2e178bb26d2-cow'
Feb 14 06:57:29 avocet systemd-udevd[25425]: timeout '/sbin/dmsetup udevcomplete 8298815'
Feb 14 06:57:30 avocet systemd-udevd[25425]: timeout: killing '/sbin/dmsetup udevcomplete 8298815' [25448]
Feb 14 06:57:30 avocet systemd-udevd[25425]: '/sbin/dmsetup udevcomplete 8298815' [25448] terminated by signal 9 (Killed)
Feb 14 06:57:38 avocet lvm[2282]: Monitoring snapshot vg0-urbackup_snap_8a15686e1ff0af5c33d27d36b0aba7ad7bb3d2e178bb26d2

Looking for old lvm error reports I found deadlock during lvm snapshot which references Re: [linux-lvm] deadlock during lvm snapshot and also debian Wiki LVM contains the comment

There are some caveats when creating LVM snapshots on Debian with udev, see 343671

I don’t know if that debian bug relates directly but debian jessie too old.

I will finish decommissioning the client vm as soon as I can get to it and I could stop urbackup snapshots till I do.