Server (OS- randomly freezing

Hello,
This is an issue that I’m having already a long time but it never presented itself as often as it does now.

So the issue:
At random times, this can be multiple time a day but could also be days apart, my server is completely freezing. SSH and ping aren’t responding anymore.
I have to hard reset the server through the datacenter’s interface, and after the reboot everything is “fine” again.
The server has been hardware tested by the datacenter (took a whole day with lot’s of load testing) and didn’t give any failure.
I have disabled the urbackup service for two weeks, and the server has been running fine during that time (still doing other ftp back-up stuff). So to me it does seem an issue triggered by UrBackup.

Linux host06 5.10.0-32-amd64 #1 SMP Debian 5.10.223-1 (2024-08-10) x86_64 GNU/Linux
UrBackup Server v2.5.33.0
4 x 6TB HDD with Raid 5 (mdraid), LVM on top of that, and BTRFS

What have I tried:

  • OS Updates (thinking it was btrfs stability issues)
  • UrBackup Updates
  • Allowed FTP only from certain hosts (thinking it might be some sort of attack since I always saw failed logins right before the server crashes).
  • Moved the database to a separate USB stick (not the most performant, but would exclude concurrency issues on the drives)
  • Looked at the console when the server is frozen, but there is nothing of importance there.

The last lines of this morning’s crash are these (logging is in debug mode):

2024-09-22 09:39:04: Established internet connection. Service=0
2024-09-22 09:39:04: Referencing snapshot on "Client x" for path "backup_8" failed: FAILED
2024-09-22 09:39:04: Authed+capa for client 'Client x' (encrypted-v2, compressed-zstd, token auth) - 1 spare connections
2024-09-22 09:39:13: Authed+capa for client 'Client y' (encrypted-v2, compressed-zstd, token auth) - 1 spare connections
2024-09-22 09:39:22: LockForTransaction in CQuery::Execute Stmt: [INSERT INTO files (backupid, fullpath, hashpath, shahash, filesize, rsize, clientid, incremental, next_entry, prev_entry, pointed_to) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)]

And the last syslog lines

Sep 22 09:17:01 host06 CRON[227900]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Sep 22 09:25:01 host06 CRON[228102]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Sep 22 09:35:01 host06 CRON[228391]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)

Does anybody know what I can try or change to get my server running stable or to get more information about the root cause?

regards,
Stijn

Sounds like an OOM problem! Try using EarlyOOM to kill those pesky memory hungry processes