UrBackup Server 2.5.31 Segfaults during startup, operations and sleeps

Dark_Angel · September 21, 2023, 12:47pm

Dear uroni,

Very happy with urbackup solution, and each time I create topic here I’d like to thank you for brilliant software.

Description:
Recently I observe increased number of crashes by the urbackup-server 2.5.31. They occur in different situations and I’ll try to cover them here, since they might be connected.

Environment: Centos 7, x86_64. 4Gb RAM server. Backup folder is on ZFS. Backups are Windows 10, Linux Centos 7. File and Image backups. 11 “clients” in total. Clients are 2.5.23.

Here are situations when it crashes:

Immidately on the startup:
This usually results in errors in log:

Aug 07 21:50:13 apparel urbackupsrv[26867]: New Backupclient: Client1
Aug 07 21:50:13 apparel urbackupsrv[26867]: New Backupclient: Client3
Aug 07 21:50:13 apparel urbackupsrv[26867]: New Backupclient: Client4
Aug 07 21:50:13 apparel urbackupsrv[26867]: Sending Identity to client "Client1" failed. Retrying soon...
Aug 07 21:50:13 apparel urbackupsrv[26867]: Sending Identity to client "Client2" failed. Retrying soon...
Aug 07 21:50:13 apparel urbackupsrv[26867]: Sending Identity to client "Client3" failed. Retrying soon...
Aug 07 21:50:13 apparel urbackupsrv[26867]: urbackupsrv: basecode.cpp:184: static void CryptoPP::BaseN_Decoder::InitializeDecodingLookupArray(int*, const byte*, unsigned int, bool): Assertion `lookup[alphabet[i]] == -1' failed.
Aug 07 21:50:13 apparel urbackupsrv[26867]: urbackupsrv: basecode.cpp:184: static void CryptoPP::BaseN_Decoder::InitializeDecodingLookupArray(int*, const byte*, unsigned int, bool): Assertion `lookup[alphabet[i]] == -1' failed.

Restarting service again usually helps. Sometimes I need to restart service twice.

Server crashes during backup execution, i.e. normal operations. Previously very rare case, now, more frequent. Here is the log output, but it’s not very helpful:

Sep 17 11:09:53 apparel urbackupsrv[24566]: Metadata of "api-ms-win-crt-conio-l1-1-0.dll" is missing
Sep 17 11:09:53 apparel urbackupsrv[24566]: Metadata of "api-ms-win-crt-convert-l1-1-0.dll" is missing
Sep 17 11:10:12 apparel kernel: fbackup main[22813]: segfault at 7fd4023f7800 ip 00007fd4023f7800 sp 00007fd377fe93f8 error 15
Sep 17 11:10:12 apparel systemd[1]: urbackup-server.service: main process exited, code=killed, status=11/SEGV
Sep 17 11:10:12 apparel systemd[1]: Unit urbackup-server.service entered failed state.
Sep 17 11:10:12 apparel systemd[1]: urbackup-server.service failed.

Backup server goes to sleep, when there are no activity planned or existing backups are stalled (for example, when clients are asleep). In 100% cases, server is crashed on resume. I was trying to reproduce issue with sending host to sleep for different amounts of time, but was not able to find exact steps to reproduce that yet. It seems that for successful reproduction I need to have at least 1 of the backups stalled.

Error logs are the same as above, it’s 11/SEGV signal.

What I have tried:

Tested memory on the server by memtest
Tried to rollback to version 2.5.30, but issues were still there

Would be glad to provide any logs or execute some steps in order to pin this down.

uroni · September 21, 2023, 6:25pm

Dark_Angel:

Aug 07 21:50:13 apparel urbackupsrv[26867]: urbackupsrv: basecode.cpp:184: static void CryptoPP::BaseN_Decoder::InitializeDecodingLookupArray(int*, const byte*, unsigned int, bool): Assertion `lookup[alphabet[i]] == -1' failed.
Aug 07 21:50:13 apparel urbackupsrv[26867]: urbackupsrv: basecode.cpp:184: static void CryptoPP::BaseN_Decoder::InitializeDecodingLookupArray(int*, const byte*, unsigned int, bool): Assertion `lookup[alphabet[i]] == -1' failed.

Looks like

github.com/weidai11/cryptopp

Intialization of base64 and other decoders are not threadsafe

opened 06:06PM - 17 Feb 17 UTC

closed 07:25AM - 01 Mar 17 UTC

MxSRF

Bug

In base 64 decode you will find the following code: ``` const int *Base64Dec…oder::GetDecodingLookupArray() { static volatile bool s_initialized = false; static int s_array[256]; if (!s_initialized) { InitializeDecodingLookupArray(s_array, s_stdVec, 64, false); s_initialized = true; } return s_array; } ``` The function `InitializeDecodingLookupArray` may run on more than one thread concurrently which may lead to fire `CRYPTOPP_ASSERT(lookup[alphabet[i]] == -1);` in `InitializeDecodingLookupArray` line 193 of basecode.cpp (based on commit 26e2e12b6a7d4c27832bd0cf522457012d5aa7a7)

Which crypto++ are you using?

E.g. enable core dumps or attach via gdb ( see e.g. https://urbackup.atlassian.net/wiki/spaces/US/pages/8323075/Debugging+with+gdb+on+Linux ), thx!

Dark_Angel · September 22, 2023, 7:15pm

checking for crypto++ version >= 5.1... yes (version 5.6.2, installed)

5.6.2 is the latest available for CentOS 7 out of the box. Fix was introduced in version 6+. Installed version 8.8.0. Restarted few times during last few days, it seems that issue has gone.

Periodic crashes during backup remains though.

Core dumps enabled, but I cannot find how to compile urbackup with debuginfo symbols. Links provided contain only information on how to compile normal version. Could you please help with the flags?

Dark_Angel · September 27, 2023, 4:02pm

Some additional data from recent segfault, hope it helps.

2023-09-27 04:15:30: Metadata of "file1" is missing
2023-09-27 04:15:30: Metadata of "file2" is missing
2023-09-27 04:15:30: Some metadata was missing
2023-09-27 04:15:30: Number of copied file entries from last backup is 26430
2023-09-27 04:15:31: Looking for old Sessions... 220 sessions
2023-09-27 04:15:31: Session timeout: Session IQnbvqkQ5OmYFFXW3BI1tfivOSY18B
2023-09-27 04:15:33: Client disconnected while backing up. Copying partial file...
2023-09-27 04:15:33: Syncing file system...
2023-09-27 04:15:34: Transferred 59.3753 MB - Average speed: 13.568 KBit/s
2023-09-27 04:15:34: 1.65149 TB of files were already present on the server and did not need to be transferred
2023-09-27 04:15:34: Script does not exist urbackup/post_incr_filebackup
[Thread 0x7fff537f6700 (LWP 18407) exited]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff5effd700 (LWP 16757)]
0x00007fff64157310 in ?? ()
Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 elfutils-libelf-0.176-5.el7.x86_64 elfutils-libs-0.176-5.el7.x86_64 libattr-2.4.46-13.el7.x86_64 libcap-2.22-11.el7.x86_64 systemd-libs-219-78.el7_9.7.x86_64 xz-libs-5.2.2-2.el7_9.x86_64
(gdb) bt full
#0  0x00007fff64157310 in ?? ()
No symbol table info available.
#1  0x00000000005ddf30 in FileClientChunked::~FileClientChunked (this=0x7fff640bbb50, __in_chrg=<optimized out>) at urbackupcommon/fileclient/FileClientChunked.cpp:99
No locals.
#2  0x0000000000807802 in ServerDownloadThreadGroup::~ServerDownloadThreadGroup (this=0x7fff640b8aa0, __in_chrg=<optimized out>) at urbackupserver/ServerDownloadThreadGroup.cpp:90
        i = 2
#3  0x000000000085e421 in std::auto_ptr<ServerDownloadThreadGroup>::~auto_ptr (this=<optimized out>, __in_chrg=<optimized out>) at /usr/include/c++/4.8.2/backward/auto_ptr.h:170
No locals.
#4  0x000000000085229a in IncrFileBackup::doFileBackup (this=0x7fff64089b20) at urbackupserver/IncrFileBackup.cpp:535
        incr_backup_starttime = 998625623
        changelevel = 3
        no_backup_dirs = false
        phash_load_offline = true
        clientlist = 0x7fff641101f0
        resumed_full = <optimized out>
        fc_chunked = {_M_ptr = 0x7fff640bbb50}
        readd_file_entries_sparse = false
        num_copied_file_entries = 26430
        r_offline = true
        list_parser = { state = FileListParser::ParseState_Type, #
....

Not a full stacktrace, unfortunately, as gdb was hanging longer than 12 hours on that list_parser with 100% CPU. So had to abort.

Also it seems that I have sequence of steps, which lead to this behavior. Moving forward I’ll try to simplify them, but this is current state:

Setup 2 concurrent tasks to be runable for the Internet client. Client is running Windows 10.
Start or resume full image + full file tasks. This usually leads to only 1 task being executed though - in my case, image task. File task stays at 0% while image task is running.
In the middle of the process, send client to sleep. Probably firewalling would also do, but in real case scenario client goes to sleep.
After curtain time (~30 mins), server has no load and goes to sleep.
On wake up (4 hours later in this case), both client and server are online, so previous tasks resume.
When client connects and actually resumes task segfault occurs. It occurs in 100% cases - I was unable to resume backup from exactly this Internet client ever after sleeps.

Seems like issue occurs on destruction of mutex which doesn’t exist.

Will try to narrow those steps down, so they would be reproducable easier. And will try to have 1 task in parallel by default and see if issue will resolve.

uroni · September 27, 2023, 7:30pm

Did you configure the number of file download threads to more then one? If yes, I’d guess this won’t be the last problem you’ll run into if you test this…

In this instance it might be an uninitialized value. So ServerDlThread dl_thread = {}; instead of ServerDlThread dl_thread; in ServerDownloadThreadGroup.cpp should fix it.

Edit: or FileClientChunked* curr_fc_chunked = nullptr;

Dark_Angel · October 5, 2023, 4:05pm

Did those changes, recompiled and testing this. Will post results later. Was not able to set FileClientChunked* curr_fc_chunked = nullptr; as nullptr appears in C++ 11 and requires gcc > 4.8, while in CentOS 7, it’s still 4.8. Has set it to null instead.

Meanwhile testing: I’ve set “Beta: Number of parallel file download threads per file backup” for one of the clients to 80. Reason is, client has big latency from the server (120+ ms), and singlethreaded it uploads with speed of 5Mbits. Changing this setting to 80, allowed to:

spawn ~37 threads on the client side (checking with procexplorer, this is Windows 10 machine)
spawn 13 upload threads, which upload simultaneously. Total connections 16 (according to procexplorer).

Server has capacity (CPU, RAM, bandwidth), client has capacity (CPU, RAM, Bandwidth). Upload is at 50 Mbit/sec average.

Expected result: 80 threads for uploading spawned, making upload around 400 Mbits/sec.

Actual result: 13 threads for uploading spawned, making upload around 50 Mbits/sec.

Questions:

Am I understanding this setting correctly? My understanding is that it should spawn 80 processes and this settings is to address my issue: slow single-thread upload speed. Or it’s intended for something else?
If it is, then is there any similar setting for image backups?

Thank you.

uroni · October 5, 2023, 6:17pm

Latency shouldn’t matter since it uses pipelining (server requests next files before the current file is sent).

Dark_Angel · October 5, 2023, 6:42pm

I do not observe speeds higher than 5Mbit per backup process on default settings. Which means, that when I run only file backup - I get 5Mbits. If I run image+file backup I observe 10 Mbits.

How many download connections server has when pipelining those requests?

Dark_Angel · October 11, 2023, 3:48pm

After patches got next segfault.

Activity-wise steps were the following:

This client started full file backup. This is internet client.
It has failed due to shadow copy deletion on the client.
After curtain time, server attempted to restart backup. Instantly got segfault.

Neither server or client has slept/were off inbetween.

2023-10-11 15:22:04: Looking for old Sessions... 178 sessions
2023-10-11 15:22:04: Session timeout: Session hcONtoDpj9XBe9t6HNruaBn9xu9Zwi
2023-10-11 15:22:05: Phash for id 113 is JgPD9m62ughNg57hxybSRllEOZYBAAAQAQAAEAEAABC5GMwkAQAAEAEAABABAAAQAQAAAAEAAAABAAAAAQAAAA--
2023-10-11 15:22:05: GT: File "._sync_46689ad72db2.db-shm" not found via hash. Loading file...
2023-10-11 15:22:05: Phash for id 114 is c0kWNA+jYJ6JJnwAbOJpBOe/QxSou64wHrvkSCK9IGoNpo58fnvlpH6C5E0YesgECoG9OnJuIXPbbhxdF346Yg--
2023-10-11 15:22:05: Loading file patch for "._sync_46689ad72db2.db-shm"
2023-10-11 15:22:05: Old filesize=32768
2023-10-11 15:22:05: GT: File "._sync_46689ad72db2.db-wal" not found via hash. Loading file...
2023-10-11 15:22:05: Hashes for file "/mnt/safecopy/urbackup/Client1/231004-1604/D/Downloads/.nextcloudsync.log" not available. Calulating hashes...
2023-10-11 15:22:05: Loading file patch for ".nextcloudsync.log"

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff4bfe7700 (LWP 23901)]
FileClientChunked::GetFilePatch (this=0x0, remotefn="tMvcsnR8c4ReFTm6QnWI|D/Downloads/.nextcloudsync.log", orig_file=orig_file@entry=0x7fff98051600, patchfile=patchfile@entry=0x7fff98150930,
    chunkhashes=chunkhashes@entry=0x7fff981507e0, hashoutput=hashoutput@entry=0x7fff98051c00, predicted_filesize=@0x7fff4bfe6bb0: 0, file_id=file_id@entry=116, is_script=is_script@entry=false,
    sparse_extents_f=sparse_extents_f@entry=0x7fff4bfe68e8) at urbackupcommon/fileclient/FileClientChunked.cpp:119
119             extent_iterator.reset();
Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 cryptopp-8.8.0-1.el7.x86_64 cyrus-sasl-lib-2.1.26-24.el7_9.x86_64 elfutils-libelf-0.176-5.el7.x86_64 elfutils-libs-0.176-5.el7.x86_64 fuse-libs-2.9.2-11.el7.x86_64 glibc-2.17-326.el7_9.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-55.el7_9.x86_64 libattr-2.4.46-13.el7.x86_64 libcap-2.22-11.el7.x86_64 libcom_err-1.42.9-19.el7.x86_64 libcurl-7.29.0-59.el7_9.1.x86_64 libgcc-4.8.5-44.el7.x86_64 libidn-1.28-4.el7.x86_64 libselinux-2.5-15.el7.x86_64 libssh2-1.8.0-4.el7.x86_64 libstdc++-4.8.5-44.el7.x86_64 libzstd-1.5.2-1.el7.x86_64 nspr-4.34.0-3.1.el7_9.x86_64 nss-3.79.0-4.el7_9.x86_64 nss-pem-1.0.3-7.el7.x86_64 nss-softokn-3.79.0-4.el7_9.x86_64 nss-softokn-freebl-3.79.0-4.el7_9.x86_64 nss-util-3.79.0-1.el7_9.x86_64 openldap-2.4.44-25.el7_9.x86_64 openssl-libs-1.0.2k-25.el7_9.x86_64 pcre-8.32-17.el7.x86_64 sqlite-3.7.17-8.el7_7.1.x86_64 systemd-libs-219-78.el7_9.7.x86_64 xz-libs-5.2.2-2.el7_9.x86_64 zlib-1.2.7-20.el7_9.x86_64
(gdb) bt
#0  FileClientChunked::GetFilePatch (this=0x0, remotefn="tMvcsnR8c4ReFTm6QnWI|D/Downloads/.nextcloudsync.log", orig_file=orig_file@entry=0x7fff98051600, patchfile=patchfile@entry=0x7fff98150930,
    chunkhashes=chunkhashes@entry=0x7fff981507e0, hashoutput=hashoutput@entry=0x7fff98051c00, predicted_filesize=@0x7fff4bfe6bb0: 0, file_id=file_id@entry=116, is_script=is_script@entry=false,
    sparse_extents_f=sparse_extents_f@entry=0x7fff4bfe68e8) at urbackupcommon/fileclient/FileClientChunked.cpp:119
#1  0x00000000008041b5 in load_file_patch (todl=..., this=<optimized out>, this=<optimized out>, this=<optimized out>, this=<optimized out>, this=<optimized out>, this=<optimized out>, this=<optimized out>, this=<optimized out>,
    this=<optimized out>, this=<optimized out>) at urbackupserver/ServerDownloadThread.cpp:805
#2  ServerDownloadThread::operator() (this=0x7fffc417db50) at urbackupserver/ServerDownloadThread.cpp:231
#3  0x00000000004b24c1 in CPoolThread::operator() (this=0x7fffc4162740) at ThreadPool.cpp:73
#4  0x0000000000477edc in thread_helper_f (t=0x7fffc4162740) at Server.cpp:1487
#5  0x00007ffff6043ea5 in start_thread () from /lib64/libpthread.so.0
#6  0x00007ffff5d6cb0d in clone () from /lib64/libc.so.6

Advanced settings tab for this client:

Dark_Angel · October 11, 2023, 4:05pm

Now server constantly crashes on attempt to resume file backup on the same file.

Removing parallel download threads for file backup fixes the issue, but now I have these ETAs (which are likely correct):