Very happy with urbackup solution, and each time I create topic here I’d like to thank you for brilliant software.
Description:
Recently I observe increased number of crashes by the urbackup-server 2.5.31. They occur in different situations and I’ll try to cover them here, since they might be connected.
Environment: Centos 7, x86_64. 4Gb RAM server. Backup folder is on ZFS. Backups are Windows 10, Linux Centos 7. File and Image backups. 11 “clients” in total. Clients are 2.5.23.
Here are situations when it crashes:
Immidately on the startup:
This usually results in errors in log:
Aug 07 21:50:13 apparel urbackupsrv[26867]: New Backupclient: Client1
Aug 07 21:50:13 apparel urbackupsrv[26867]: New Backupclient: Client3
Aug 07 21:50:13 apparel urbackupsrv[26867]: New Backupclient: Client4
Aug 07 21:50:13 apparel urbackupsrv[26867]: Sending Identity to client "Client1" failed. Retrying soon...
Aug 07 21:50:13 apparel urbackupsrv[26867]: Sending Identity to client "Client2" failed. Retrying soon...
Aug 07 21:50:13 apparel urbackupsrv[26867]: Sending Identity to client "Client3" failed. Retrying soon...
Aug 07 21:50:13 apparel urbackupsrv[26867]: urbackupsrv: basecode.cpp:184: static void CryptoPP::BaseN_Decoder::InitializeDecodingLookupArray(int*, const byte*, unsigned int, bool): Assertion `lookup[alphabet[i]] == -1' failed.
Aug 07 21:50:13 apparel urbackupsrv[26867]: urbackupsrv: basecode.cpp:184: static void CryptoPP::BaseN_Decoder::InitializeDecodingLookupArray(int*, const byte*, unsigned int, bool): Assertion `lookup[alphabet[i]] == -1' failed.
Restarting service again usually helps. Sometimes I need to restart service twice.
Server crashes during backup execution, i.e. normal operations. Previously very rare case, now, more frequent. Here is the log output, but it’s not very helpful:
Sep 17 11:09:53 apparel urbackupsrv[24566]: Metadata of "api-ms-win-crt-conio-l1-1-0.dll" is missing
Sep 17 11:09:53 apparel urbackupsrv[24566]: Metadata of "api-ms-win-crt-convert-l1-1-0.dll" is missing
Sep 17 11:10:12 apparel kernel: fbackup main[22813]: segfault at 7fd4023f7800 ip 00007fd4023f7800 sp 00007fd377fe93f8 error 15
Sep 17 11:10:12 apparel systemd[1]: urbackup-server.service: main process exited, code=killed, status=11/SEGV
Sep 17 11:10:12 apparel systemd[1]: Unit urbackup-server.service entered failed state.
Sep 17 11:10:12 apparel systemd[1]: urbackup-server.service failed.
Backup server goes to sleep, when there are no activity planned or existing backups are stalled (for example, when clients are asleep). In 100% cases, server is crashed on resume. I was trying to reproduce issue with sending host to sleep for different amounts of time, but was not able to find exact steps to reproduce that yet. It seems that for successful reproduction I need to have at least 1 of the backups stalled.
Error logs are the same as above, it’s 11/SEGV signal.
What I have tried:
Tested memory on the server by memtest
Tried to rollback to version 2.5.30, but issues were still there
Would be glad to provide any logs or execute some steps in order to pin this down.
checking for crypto++ version >= 5.1... yes (version 5.6.2, installed)
5.6.2 is the latest available for CentOS 7 out of the box. Fix was introduced in version 6+. Installed version 8.8.0. Restarted few times during last few days, it seems that issue has gone.
Periodic crashes during backup remains though.
Core dumps enabled, but I cannot find how to compile urbackup with debuginfo symbols. Links provided contain only information on how to compile normal version. Could you please help with the flags?
Some additional data from recent segfault, hope it helps.
2023-09-27 04:15:30: Metadata of "file1" is missing
2023-09-27 04:15:30: Metadata of "file2" is missing
2023-09-27 04:15:30: Some metadata was missing
2023-09-27 04:15:30: Number of copied file entries from last backup is 26430
2023-09-27 04:15:31: Looking for old Sessions... 220 sessions
2023-09-27 04:15:31: Session timeout: Session IQnbvqkQ5OmYFFXW3BI1tfivOSY18B
2023-09-27 04:15:33: Client disconnected while backing up. Copying partial file...
2023-09-27 04:15:33: Syncing file system...
2023-09-27 04:15:34: Transferred 59.3753 MB - Average speed: 13.568 KBit/s
2023-09-27 04:15:34: 1.65149 TB of files were already present on the server and did not need to be transferred
2023-09-27 04:15:34: Script does not exist urbackup/post_incr_filebackup
[Thread 0x7fff537f6700 (LWP 18407) exited]
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff5effd700 (LWP 16757)]
0x00007fff64157310 in ?? ()
Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 elfutils-libelf-0.176-5.el7.x86_64 elfutils-libs-0.176-5.el7.x86_64 libattr-2.4.46-13.el7.x86_64 libcap-2.22-11.el7.x86_64 systemd-libs-219-78.el7_9.7.x86_64 xz-libs-5.2.2-2.el7_9.x86_64
(gdb) bt full
#0 0x00007fff64157310 in ?? ()
No symbol table info available.
#1 0x00000000005ddf30 in FileClientChunked::~FileClientChunked (this=0x7fff640bbb50, __in_chrg=<optimized out>) at urbackupcommon/fileclient/FileClientChunked.cpp:99
No locals.
#2 0x0000000000807802 in ServerDownloadThreadGroup::~ServerDownloadThreadGroup (this=0x7fff640b8aa0, __in_chrg=<optimized out>) at urbackupserver/ServerDownloadThreadGroup.cpp:90
i = 2
#3 0x000000000085e421 in std::auto_ptr<ServerDownloadThreadGroup>::~auto_ptr (this=<optimized out>, __in_chrg=<optimized out>) at /usr/include/c++/4.8.2/backward/auto_ptr.h:170
No locals.
#4 0x000000000085229a in IncrFileBackup::doFileBackup (this=0x7fff64089b20) at urbackupserver/IncrFileBackup.cpp:535
incr_backup_starttime = 998625623
changelevel = 3
no_backup_dirs = false
phash_load_offline = true
clientlist = 0x7fff641101f0
resumed_full = <optimized out>
fc_chunked = {_M_ptr = 0x7fff640bbb50}
readd_file_entries_sparse = false
num_copied_file_entries = 26430
r_offline = true
list_parser = { state = FileListParser::ParseState_Type, #
....
Not a full stacktrace, unfortunately, as gdb was hanging longer than 12 hours on that list_parser with 100% CPU. So had to abort.
Also it seems that I have sequence of steps, which lead to this behavior. Moving forward I’ll try to simplify them, but this is current state:
Setup 2 concurrent tasks to be runable for the Internet client. Client is running Windows 10.
Start or resume full image + full file tasks. This usually leads to only 1 task being executed though - in my case, image task. File task stays at 0% while image task is running.
In the middle of the process, send client to sleep. Probably firewalling would also do, but in real case scenario client goes to sleep.
After curtain time (~30 mins), server has no load and goes to sleep.
On wake up (4 hours later in this case), both client and server are online, so previous tasks resume.
When client connects and actually resumes task segfault occurs. It occurs in 100% cases - I was unable to resume backup from exactly this Internet client ever after sleeps.
Seems like issue occurs on destruction of mutex which doesn’t exist.
Will try to narrow those steps down, so they would be reproducable easier. And will try to have 1 task in parallel by default and see if issue will resolve.
Did you configure the number of file download threads to more then one? If yes, I’d guess this won’t be the last problem you’ll run into if you test this…
In this instance it might be an uninitialized value. So ServerDlThread dl_thread = {}; instead of ServerDlThread dl_thread; in ServerDownloadThreadGroup.cpp should fix it.
Edit: or FileClientChunked* curr_fc_chunked = nullptr;
Did those changes, recompiled and testing this. Will post results later. Was not able to set FileClientChunked* curr_fc_chunked = nullptr; as nullptr appears in C++ 11 and requires gcc > 4.8, while in CentOS 7, it’s still 4.8. Has set it to null instead.
Meanwhile testing: I’ve set “Beta: Number of parallel file download threads per file backup” for one of the clients to 80. Reason is, client has big latency from the server (120+ ms), and singlethreaded it uploads with speed of 5Mbits. Changing this setting to 80, allowed to:
spawn ~37 threads on the client side (checking with procexplorer, this is Windows 10 machine)
spawn 13 upload threads, which upload simultaneously. Total connections 16 (according to procexplorer).
Server has capacity (CPU, RAM, bandwidth), client has capacity (CPU, RAM, Bandwidth). Upload is at 50 Mbit/sec average.
Expected result: 80 threads for uploading spawned, making upload around 400 Mbits/sec.
Actual result: 13 threads for uploading spawned, making upload around 50 Mbits/sec.
Questions:
Am I understanding this setting correctly? My understanding is that it should spawn 80 processes and this settings is to address my issue: slow single-thread upload speed. Or it’s intended for something else?
If it is, then is there any similar setting for image backups?
I do not observe speeds higher than 5Mbit per backup process on default settings. Which means, that when I run only file backup - I get 5Mbits. If I run image+file backup I observe 10 Mbits.
How many download connections server has when pipelining those requests?