BTRFS very high iowait (even if stop urbackup service) Please help!

Hi!

I have UrBackup server for internet backups only.
It has 70 clients.

The problem is very high iowait 25-65%. Write speed around 8-15 Mb/sec!!! Because of this, backups do not have time to be created.
Time of mounts btrfs 5 minutes!

Even If I stop urbackup service iowait still 25-30%. And linear write speed still around 8-15Mb/sec.

I read that when btrfs has a large number of snapshots, the file system starts to slow down a lot. But it is snapshots that are used to create backups.
I have about 80 snapshots for each client, i.e. more than 1000 snapshots in total.

I tried everything I could, but I couldn’t find a solution. Please help!
Someone had a similar problem. And how did you solve it. I will be happy to any suggestions.

Backups creates everyday. For each client:
Minimal number of incremental file backups: 40 max 100
Minimal number of incremental image backups: 20 max 40

The size - 25Tb.
df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vgurbackup-lvurbackup1 25T 20T 5.5T 78% /media/BACKUP/urbackup

While I personally use btrfs as my main filesystem on multiple linux machines (including my urbackup server) I have found that it has many degenerate performance issues regarding cpu usage and I/O scheduling under heavy load in combination with large numbers of snapshots. My only generic suggestion to you would be to drastically reduce the number of snapshots per client and let the urbackup server clear those out over time to see if you can recover some of your lost performance.

Depending on your chosen Linux distribution, you might want to follow the latest 5.10 LTS kernel which is supposed to contain numerous btrfs related improvements. However, be warned that the current 5.10 kernel also has several btrfs related performance regression issues so definitely do not rush to upgrade even if you are capable of doing so.

I bet it’s about the same problem as

If you use iotop you’ll probably see btrfs-cleaner doing a lot of io deleting snapshots.

UrBackup doesn’t use the kind of operations that don’t scale well with a (moderate) amount of snapshots (e.g. btrfs send/balance etc.), so that isn’t the problem (except perhaps if you storage is more than say 90% full).

Your storage simply isn’t fast enough (w.r.t. random IOPS) to keep up with the amount of backups/data changes + deletion you want it to do.

Thanks for your answers!

so that isn’t the problem (except perhaps if you storage is more than say 90% full

No, not more 85% full.

If you use iotop you’ll probably see btrfs-cleaner doing a lot of io deleting snapshots.

Now I can’t check it, but I have screenshots of glances. The most proceses are btrfs-endio.

I’m agree with @alyandon
I have found too that it has many degenerate performance with large numbers of snapshots.
Because when I moved UrBackup to new server without backups (with empty storage) I started full file backup at 25 clients and the speed was very high (hdd write 60-150Mb/sec, 400-500Mbit/sec), iowait was 10-30% not more, screenshot bellow. It was 9 nov 2020.

Everything was good, perfomance was normal. Around 23-25 december I saw that perfomance very low.
As I understand for this 2 months UrBackup creates very much snapshots and because of it btrfs becomes work slow.
The above is just my theory, perhaps it is wrong. Please confirm I’m write or wrong.

The question in why does this happened after 2 months use? Where is a crytical point when btrfs starts work very slow? And what can we do?

Screenshot with high perfomance:

You could try turning down the number of allowed simultaneous backups that run. Other than that I can’t think of anything else.

Edit: Assuming recent kernels, you could also try mq-deadline vs bfq I/O scheduler. I actually see better overall iowait with btrfs under load on spinning rust with the bfq scheduler which is not the default on Ubuntu.

If we’re at special tuning. Make Linux wait longer before writing dirty data to disk. Add vm.dirty_expire_centisecs=8640000 to /etc/sysctl.conf. Maybe also tune vm.dirty_ratio and vm.dirty_background_ratio (e.g. 60+40). Mount btrfs with commit=86400. I’d only do this with machines that only run UrBackup because other applications might need their data flushed to disk even if they don’t explicitly say so.

Is it safe for backup datas (data integrity at the backup storage)?
I mean if power at server would turn off. As I undersand datas would not be written and data integrity would corrupt. Or am I wrong?

@alyandon I use Debian 10. I checked, by default it use mq-deadline.

Why bfq scheduler better? Could you please write about it more detailly.

Thank you all!

mq-deadline just seemed to perform really poorly for me on my setups so it was just one of those things where switching to the bfq I/O scheduler appeared to result in fewer severe iowait storms (other processes generally remained responsive).

My setups are commodity setups - non-raid controller/sata connected 7200 rpm hdds so I can’t guarantee it’ll help you but it doesn’t hurt anything to try it out either. I run the following script on startup to set the scheduler on my systems for sd? block devices:

#!/bin/sh

for i in /sys/block/sd*/queue/scheduler ; do echo "bfq" > $i; done

Hello! Same issue. We added on some BTRFS servers mount-timeout=600 option in fstab file.
UUID=51b34f7e-dfaa-4ae4-bcf1-41d5028f6afc /mnt/storage btrfs defaults,x-systemd.mount-timout=600 0 2

Also several servers switched to ZFS and there is no issues yet. I mean we really didn’t expect to many performance and stability issues as on BTRFS servers no matter how many backup clients 10 or 50.

Gen 5 kernels promised dramatically BTRFS performance enchantment from 5.9 but with 5.10 kernel we got same issues. May be in future changes become obvious.

P.S. Working with urBackup more then 2 years. Collected some experience about this. Shortly this is best backup solution then software like Acronis/Veeam because doing same tasks very well except some specific Hyper-V Client or CBT (that not free but working fine as well). Working currently on big article about all my experience with urBackup. Planning to publish it in next month.

Hello!

As I understand very high iowait when file system has much snapshots. I’m interested you experience. How long your ZFS servers work? I mean only after month or more this problems happened (because after time file system has much snapshots).

Once you’ve tried both filesystems. With the same parameters (number of clients, backup settings, etc.), which of them faster and works better?

For example:

● urbackupsrv.service - LSB: Server for doing backups
     Loaded: loaded (/etc/init.d/urbackupsrv; generated)
     Active: active (running) since Wed 2020-09-30 08:15:11 EDT; 3 months 16 days ago
       Docs: man:systemd-sysv-generator(8)
    Process: 1149 ExecStart=/etc/init.d/urbackupsrv start (code=exited, status=0/SUCCESS)
      Tasks: 35 (limit: 9469)
     Memory: 3.4G
     CGroup: /system.slice/urbackupsrv.service
             └─1205 /usr/bin/urbackupsrv run --config /etc/default/urbackupsrv --daemon --pidfile /var/run/urbackupsrv.pid

Sep 30 08:15:11 vbackup....local systemd[1]: Starting LSB: Server for doing backups...
Sep 30 08:15:11 vbackup....local systemd[1]: Started LSB: Server for doing backups.

sudo zpool status

  pool: storage
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
	still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
	the pool may no longer be accessible by software that does not support
	the features. See zpool-features(5) for details.
  scan: scrub repaired 0B in 0 days 02:55:32 with 0 errors on Sun Jan 10 03:19:34 2021
config:

	NAME        STATE     READ WRITE CKSUM
	storage     ONLINE       0     0     0
	  sdb       ONLINE       0     0     0

errors: No known data errors

free -m

              total        used        free      shared  buff/cache   available
Mem:           7946        7338         125           1         482         353
Swap:          2047        2044           3

uptime

 06:41:09 up 107 days, 23:26,  0 users,  load average: 0.05, 0.05, 0.06

There is 17 clients. Also I have server with 55 clients it’s worked well for 4 month but was restarted 8 days ago due hypervisor maintenance.

Besides my experience with BTRFS you can find here: High memory utilization - Server - UrBackup - Discourse