I have a server with a hardware RAID6 array (56 TB) formatted as Btrfs and mounted only for backups.
Over time I noticed that during reboot the server was hanging longer and longer at the message:
A start job is running for /media/backup
As you might guess, /media/backup is my backup mount point.
At first this delay was about 1 minute, then 3–4 minutes, and today the server failed to boot normally at all — it dropped into single-user mode.
At the same time, the /media/backup disk was mounted and accessible, but rebooting resulted in the same issue again.
I had to change the mount options to:
/media/backup btrfs defaults,nofail 0 0
The server booted, but the disk was not mounted. Manually mounting it succeeded, but took about 6 minutes.
Manually mounting logs:
Dec 9 13:52:31 backup systemd[1]: media-backup.mount: Mount process still around after SIGKILL. Ignoring.
Dec 9 13:52:31 backup systemd[1]: media-backup.mount: Failed with result 'timeout'.
Dec 9 13:52:31 backup systemd[1]: media-backup.mount: Unit process 688 (mount) remains running after unit stopped.
Dec 9 13:52:31 backup systemd[1]: Failed to mount /media/backup.
Dec 9 13:52:31 backup systemd[1]: Startup finished in 5.258s (kernel) + 4min 34.747s (userspace) = 4min 40.005s.
Dec 9 13:52:31 backup systemd[1]: media-backup.mount: Consumed 1.941s CPU time.
Dec 9 13:53:17 backup systemd[1]: Starting Download data for packages that failed at package install time...
Dec 9 13:53:17 backup systemd[1]: update-notifier-download.service: Deactivated successfully.
Dec 9 13:53:17 backup systemd[1]: Finished Download data for packages that failed at package install time.
As I understand it, the problem is related to the disk’s internal structure.
What can be recommended as preventative maintenance for a Btrfs filesystem in order to speed up mounting/unmounting?
When was the last time you balanced your btrfs array? Or have you done a traditional software RAID (mdadm) and then a btrfs filesystem on top of that?
If you let btrfs do the RAID, then it will adjust to the underlying hardware. If you do software RAID and then a btrfs partition or filesystem, btrfs will behave very differently.
In both cases, you will need to do some regular maintenance. In the case of btrfs RAID, you will need to do a regular balance. In the case of a btrfs filesystem, you will need to scrub regularly. Both operations reallocate data locations and optimise indexes, generally cleaning things up. However, if you don’t do this regularly, the cleanup operation can take months to finish. I’d recommend doing the metadata first (-musage) and limiting the amount inspected at a time (start at -dusage=1 and increment exponentially).
No matter which you use, btrfs filesystems get exponentially worse the closer they get to full, but fullness is defined by btrfs, not df -h. See btrfs de us and btrfs de df to get information on what btrfs really thinks of your data.
Unfortunately, at the beginning I simply didn’t know about this particular behavior of Btrfs.
Right now I’m trying to run a metadata rebalance on the array, but based on my calculations it may take around a month to finish. The system shows that it needs to rebalance 574 chunks out of 3464 considered, and it is processing them at roughly 20-25 chunks per day.
Rebalancing won’t make this faster (in fact you do not need to rebalance in most cases). I’d just wait the 6 min it takes to work and perhaps adjust systemd appropriately.
They have implemented a change in btrfs to make this faster (btrfs block group tree) – you could consider looking into using that.