UrBackup server says 'No space left on device (code: 28)'

Karl_Buckland · November 28, 2016, 9:37am

UrBackup Server 2.0.36.

I’ve had this set up running for a couple of months without any issues. Yesterday I ran upgrades to the OS and rebooted.

Since then, a message has appeared at the top-left of the status screen:

I have UrBackup running on a container within Proxmox on a BTRFS volume. I’ve run a number of commands to verify the used and free space on the drive and everything tells me that there isn’t an issue.

There are two important volumes here, /backup (btrfs) which holds the backup storage and /backupmeta (ext4) which holds the meta data for the backups.

/backup is made up of two disks.

From within the container (Ubuntu 16.04):

$ df -h
Filesystem                                Size  Used Avail Use% Mounted on
/dev/loop0                                 20G  2.3G   17G  12% /
/dev/mapper/backup_meta-vm--100--disk--1  246G   31G  204G  13% /backupmeta
none                                      492K     0  492K   0% /dev
/dev/sdc1                                 7.3T  2.4T  4.9T  33% /backup
tmpfs                                      16G     0   16G   0% /dev/shm
tmpfs                                      16G  8.3M   16G   1% /run
tmpfs                                     5.0M     0  5.0M   0% /run/lock
tmpfs                                      16G     0   16G   0% /sys/fs/cgroup
tmpfs                                     3.2G     0  3.2G   0% /run/user/2001

$ sudo btrfs fi usage /backup
Overall:
    Device size:                   7.28TiB
    Device allocated:              2.44TiB
    Device unallocated:            4.84TiB
    Device missing:                7.28TiB
    Used:                          2.38TiB
    Free (estimated):              4.84TiB      (min: 2.42TiB)
    Data ratio:                       1.00
    Metadata ratio:                   2.00
    Global reserve:              512.00MiB      (used: 0.00B)

Data,RAID0: Size:2.16TiB, Used:2.16TiB
   /dev/sdc1       1.08TiB
   /dev/sdd1       1.08TiB

Metadata,RAID1: Size:139.00GiB, Used:109.75GiB
   /dev/sdc1     139.00GiB
   /dev/sdd1     139.00GiB

System,RAID1: Size:8.00MiB, Used:192.00KiB
   /dev/sdc1       8.00MiB
   /dev/sdd1       8.00MiB

Unallocated:
   /dev/sdc1       2.42TiB
   /dev/sdd1       2.42TiB

The global soft filesystem quota is set to 80%, so that shouldn’t be the issue here either. I’m at a loss - can anyone help?

TomTomGo · November 28, 2016, 11:47am

Maybe your BTRFS is mounted read-only or not mounted at all. This can happens after reboot if mounting the BTRFS volume takes too much time.
You should also check your /var/log/syslog or /var/log/kern.log about BTRFS messages.

Regards,

uroni · November 28, 2016, 1:16pm

It’s a btrfs bug. See here for example for others having the same problem: https://www.spinics.net/lists/linux-btrfs/msg51046.html

I guess you have tried rebooting?

Karl_Buckland · November 28, 2016, 1:44pm

Hi Uroni,

Yes that’s the issue. I’ve just rebooted and it has fixed the problem for the time being. That is quite a bug!

I was planning to use BTRFS on our production servers and now I’m not so certain.

Homer · November 29, 2016, 2:52am

I naively jumped onto the BTRFS bandwagon some time ago, then quickly regretted it.

It isn’t that it has bugs. All software has bugs. Bugs can be fixed. The problem is that BTRFS is fundamentally flawed at the concept level, and that is not something that can simply be patched.

The complexity of BTRFS is so extreme that even its own developers have been unable to produce a completely functional fsck utility for it, and indeed don’t even seem to know if its possible to do so, after many years of development.

BTRFS also becomes heavily fragmented … by design, then eventually attempts to rebalance the filesystem, thrashing the disk, consuming vast amounts of RAM, and slowing the system to a crawl in the process.

On my machine the problem became so bad that at one point I was effectively locked out, completely unable to do anything for an entire day, while the system corrected itself, and sadly this correction was only fleeting, as the problem returned the next day.

This same “feature” is apparently what is also responsible for your mysterious loss of disk capacity. The mechanism by which BTRFS maintains free space remains largely a mystery, barely understood even by its developers, and completely unmanageable by its users.

I find it hard to conceive of any scenario in which BTRFS would actually be usable, much less useful. It certainly should never be used in any mission critical application like backup.

ZFS is a reasonable alternative, if you can live with its vast RAM overhead and its anti-GPL license (although the latter is moot if you’re running FreeBSD). Otherwise I would just stick to ext4.

Karl_Buckland · December 13, 2016, 10:38am

Hi Homer,

After some further testing, I fear you are correct.

The issue reappeared again yesterday and a further two reboots have fixed the issue. It’s highly frustrating. I’m wondering what I will lose by moving away from BTRFS. I’m under the impression that I will require more storage to run UrBackup?

Is there another filesystem that you would recommend? I was originally planning on rolling BTRFS out to all of our production systems to easily perform snapshots when backing up, but instead I’ve switched to using EXT4 on LVM. I can’t risk these sorts of issues taking down our production systems.

Homer · December 13, 2016, 11:20am

If you have plenty of RAM and want deduplication, use ZFS. Otherwise use ext4 + LVM.

Alternatively you could try ddumbfs, which does deduplication but is less resource hungry.

uroni · December 13, 2016, 11:26am

Consider using XFS instead of EXT4. XFS will get reflink capability soon which UrBackup will use.

uroni · December 13, 2016, 11:47am

I’ll have to write a blog post about btrfs, but the bottom line is that you need to be aware of the Linux kernel version you run. The main devs run it in production at Facebook, but they use it as a HDFS storage node (so no compression and multi-device features) and they backport the latest btrfs code to the Linux version they are currently using. So that is a 4.9 equivalent currently.
Specifically the ENOSPC handling got significantly rewritten with Linux kernel 4.8.
SuSE is supporting production use for a certain subset (as root fs) of btrfs (which does not include compression last time I checked) and they are, I think, using 4.4.x. They went the route of outright disabling stuff that they do not support/does not work, like btrfs RAID5.

ZFS also becomes heavily fragmented by design, and it does – contrary to btrfs – not have a defragmentation utitlity (and will never have one). On the plus side ZFS already has build-in cache device options to improve performance with heavility fragmented file systems.
ZFS also doesn’t even have a fsck utility because the assumption is that it would run so long you might as well restore from a backup (and it breaking is unlikely because of device redundancy and ECRAM on the machine). The same holds for all larger file systems – if it becomes large enough, fsck is infeasible. XFS is getting an online fsck/scrub like btrfs as well.

Karl_Buckland · December 13, 2016, 11:59am

Regarding BTRFS and kernel versions - I’m running 4.4.16-1. This issue seems quite serious to me though. The server isn’t under any great load and day to day, only a few Gb are written to disk. To experience this problem twice in a couple of weeks is quite troubling and for the only real solution to be to reboot until it works is very poor.

We used to use XFS instead of EXT3, but switched to EXT4 once that was available.I always found XFS very solid, although if I recall correctly it did need defragmenting occasionally.

Perhaps I’ll experiment with ZFS, although the fragmentation issue worries me slightly. We have about 4Tb backed up - not sure on the number of files - a few million I expect, but the backup server has 32Gb of RAM currently so it should be able to handle it.

Thanks everyone - very useful.

TomTomGo · December 13, 2016, 12:23pm

We’re running BTRFS since few months on Debian 8, initially with the v4.6 backported Kernel and btrfs-progs / tools.
Since one month, Kernel and btrfs progs / tools have been upgraded to the latest 4.7.

Currently we doesn’t have any reliability issues using RAID0 and zlib compression, only some high IO wait, which i think we’re going to resolve with a BTRFS balance.

Before using BTRFS, we tried to backup with VHDz on an EXT4 partition. That was working fine but synthetic backup was very long to rebuild and it was needing more space. BTRFS has the advantage of incremental forever backups and inline compression which saves a lot of disk space.

Maybe we’ve done a bad choice with BTRFS but for the moment we doesn’t have encountered any problems.

Regards,

Homer · December 14, 2016, 2:43am

I haven’t been following recent developments in btrfs, mainly because I’d written it off as a lost cause, but according to the patchset and comments I’m looking at, the enospc rewrite seems to be entirely about boosting performance by queueing flushes to an asynchronous handler, without actually addressing the more fundamental problem of the large i/o overhead intrinsic to btrfs, nor its unpredictable and unmanageable space reservation.

In any case, I understand that no filesystem will ever be perfect, but the entire concept of btrfs is far too haphazard for my tastes, and I’m not very keen on gambling with my irreplaceable data.

However, just out of curiosity, I would be interested in updated benchmarks to see if, and by how much, the abysmal performance of btrfs has improved, even if its more fundamental problems haven’t.