Steps to perform after BTRFS error on server?

So, a little while ago, my Debian/BTRFS (which is a virtual machine) system had a kenel panic crash which said something about BTRFS. This was the one and only time the server has crashed (never had a power failed, host crash, etc…). I didn’t think too much about it at the time and just updated Debian to the latest version and things seemed to be working fine after that…

Today I moved the server to a new host and data store (the old and new hosts both have ECC and hardware RAID6 with no errors), and I decided that since the server was offline, maybe it’d be a good time to do a btrfs check on the data drive to make sure there aren’t any issues before bringing it back online (I was still a bit concerned from the original kernel panic crash).

Here’s the error that came from the check:

My questions are:

  1. What would be the recommended steps to repair the file system?
  2. What do I need to do to confirm file & image backups are complete / not corrupted?
  3. What can be done to prevent/mitigate file system damage from a hard crash in the future?

I originally had the “sync” mount option enabled on the drive, but ended up disabling it due to really bad performance… maybe that was a bad idea? At the moment, there isn’t any issue with mounting the drive, so the corruption doesn’t seem that bad. Most of the articles I’ve seen online deal with unmountable filesystems, so I’m hesitant to implement fixes that may or may not be applicable to my situation.

I took a snapshot of this VM, so I’m free to experiment with various repair methods…

So, it turns out that maybe the filesystem errors were benign. When I was running btrfs check, I noticed it was using a lot of memory. I also noticed that when the filesystem was mounted, btrfs was doing cleaning stuff in the background.

I bumped up the RAM allocation to 64GB and let it finish doing whatever cleaning it was doing before while the filesystem was mounted, until there was no disk IO. When I reran the check afterwards, it consumed about 40GB but didn’t generate any errors. So maybe it’s fine after all.

The way to check a btrfs file system is btrfs scrub btw. btrfs check is if you want to repair a broken file system which should not happen if there are no btrfs bugs and you use ECC RAM.