UrBackup 1G and above network performance not possible?

Ant · September 24, 2025, 9:05pm

Here is the iperf3 output of a test from local Windows 2025box to the Linux Urbackup server.

Connecting to host 10.38.1.12, port 5201

[ 5] local 10.38.1.10 port 59205 connected to 10.38.1.12 port 5201

[ ID] Interval Transfer Bitrate

[ 5] 0.00-1.01 sec 1.16 GBytes 9.93 Gbits/sec

[ 5] 1.01-2.01 sec 1.15 GBytes 9.87 Gbits/sec

[ 5] 2.01-3.01 sec 1.15 GBytes 9.91 Gbits/sec

[ 5] 3.01-4.02 sec 1.17 GBytes 9.91 Gbits/sec

[ 5] 4.02-5.01 sec 1.13 GBytes 9.90 Gbits/sec

[ 5] 5.01-6.01 sec 1.15 GBytes 9.91 Gbits/sec

[ 5] 6.01-7.01 sec 1.14 GBytes 9.78 Gbits/sec

[ 5] 7.01-8.01 sec 1.15 GBytes 9.89 Gbits/sec

[ 5] 8.01-9.01 sec 1.15 GBytes 9.91 Gbits/sec

[ 5] 9.01-10.01 sec 1.15 GBytes 9.89 Gbits/sec

- - - - - - - - - - - - - - - - - - - - - - - - -

[ ID] Interval Transfer Bitrate

[ 5] 0.00-10.01 sec 11.5 GBytes 9.89 Gbits/sec sender

[ 5] 0.00-10.01 sec 11.5 GBytes 9.89 Gbits/sec receiver

As you can see the links are fully saturated. Since the storage is U.2 based and these servers have 20 cores and 256G Ram they are easily capable of these sort of speeds.

Here is a log of a VM backup involving the same two host computers as in the speed test above.

Starting unscheduled full image backup of volume “SCSI:0:0”…

Basing image backup on last incremental or full image backup

Error retrieving last image backup. Doing full image backup instead.

Loading ZIP metadata from urbackup/mbr_data_y ADRoot Dc1–SCSI_0_0.zip

Server[y AdRoot DC1]: Loading MBR zip file…

Server[y AdRoot DC1]: Loaded MBR zip file (9.97168 KB)

Transferred 23.325 GB - Average speed: 813.435 MBit/s

Time taken for backing up client Server[y AdRoot DC1]: 4m 29s

Backup succeeded

I am interested in the 813MB/s statistic since this is 8% of the network capability. It’s a local copy, two machines next to each other. No compression, no encryption. So 92% of my network is unused in a backup scenario it seems. When I look at the disk reads they are very small. The disk is reading around 150Mb/s but only for a second or so and then it stops while presumably that chunk is transmitted to the server. The Disk is capable of reading at 3000Mb/s.

In frustration I pasted the ClientSend code into claude.ai in the context of the above problem and observations.

Looking at this ClientSend code, I can see some potential performance issues:

Key Problems:

Single-threaded sending: This appears to be a single sender thread that processes one buffer at a time from a queue. No parallel sending.
Small buffer management: Uses CBufMgr2(nbufs, bsize) - the buffer size (bsize) and number of buffers (nbufs) are likely too small for high-speed networks.
Synchronous writes with timeout: pipe->Write(item.buf, item.bsize, 60000, false) - the false parameter suggests synchronous writes, waiting for each buffer to complete before sending the next.
Queue-based architecture: Uses a simple queue with condition variables, which adds overhead vs direct streaming.
Frequent flushing: Calls pipe->Flush(-1) multiple times, which forces TCP to send data immediately rather than batching efficiently.

What’s missing for 10G performance:

No TCP_NODELAY/buffering optimization
No async I/O or overlapped operations
No multiple parallel streams
Likely small buffer sizes (typical defaults are 64KB-1MB, but 10G needs much larger)

The architecture is fundamentally designed for slower networks. For 10G speeds, you’d want:

Multiple sender threads
Large buffers (8-32MB)
Async/overlapped I/O
Proper TCP window scaling

This explains perfectly why you’re stuck at 800 Mbps - it’s a classic single-threaded, small-buffer network implementation.

I also had Claude.ai take a look at the Connector code and here is it’s response. . .

Looking at this Connector code, I can see more evidence of urbackup’s networking limitations:

Key networking issues in this code:

Simple socket implementation: Uses basic socket(), connect(), send(), recv() calls without any optimization
No TCP optimization: No evidence of:
- TCP_NODELAY to disable Nagle’s algorithm
- SO_SNDBUF/SO_RCVBUF for larger buffer sizes
- Window scaling options
- Any modern TCP performance tuning
Synchronous I/O: The recv() call in a simple while loop:

cpp

   int rc=recv(p, buffer, 1024, MSG_NOSIGNAL);

Small 1024-byte buffer
Blocking receive calls
No async/overlapped I/O

Basic protocol: Uses a simple text-based protocol with CTCPStack.Send() - likely not optimized for bulk data transfer
Single connection per operation: Creates one socket, sends command, receives response, closes socket - no connection pooling or persistent connections

This explains the 800 Mbps ceiling perfectly:

1024-byte receive buffer is tiny for 10G networks
No TCP tuning means default (conservative) TCP window sizes
Synchronous blocking I/O prevents overlapping network and disk operations
Text-based protocol has overhead vs binary protocols

This is classic “works fine on 100 Mbps networks” code that falls apart at 10G speeds. The architecture is fundamentally designed for low-speed networks and would need a complete rewrite to handle modern high-speed networking properly.

This a very frustrating to have burned well over a week of time into this, only to find this huge limitation in the design architecture. Sigh.

ffsb · September 28, 2025, 3:00am

not sure how easy it is for you to setup, but would it possible to perform the same test with 2 clients backing up at the same time?
Just curious to check if the the limit is 1GB per client and overall your 10GB ethernet server could backup 10 clients in parallel at a 10GB rate…

I am asking because we are about to rollout 10GB switch including on the urbackup server…

Ant · September 28, 2025, 6:44pm

It will take some time as I re-purposed the storage previously assigned to the UrBackupVM for the below backup data. The VM with URbackup server is still built but I will need to steal some storage from somewhere else to do it.

I posess non-formal self taught [read limited] dev skills and I found that I could make this project

operate at near line speed (10G) on a “direct copy” -d switch by replacing 541 - 546 in the BackupManager.cs file from the HVBackup.Engine project with the below code block.

    else if (File.Exists(srcPath))
    {
        var outputName = Path.Combine(vmBackupPath, vmPath.Substring(volumePathLength));
        var folder = Path.GetDirectoryName(outputName);
        if (!Directory.Exists(folder))
        {
            Directory.CreateDirectory(folder);
        }
        File.Copy(srcPath, outputName, true);
    }

What this does is make a direct VSS File.Copy Whereas the original implementation was using streams and topped out at 2.5Gb/s.

File.Copy a direct system call and can use hardware acceleration and large kernel buffers.
Stream.CopyTo is constrained by the buffer size and managed overhead, so throughput is lower.

Also: I now have working (just today) and am refining a “producer → consumer → writer” style LZ4 codec based parallel compressor (just using L00_FAST at the moment) to reduce backup storage by compressing the source. I have it working but need to tune the GC as VHD files end up as thousands of 4MB blocks being compressed across cores in the compressor code.

Even the fastest (least compression) option of this codec reduces the size of a 37GB VHD to 22GB and costs only 30 extra seconds at 137 seconds total versus 67 seconds total for the uncompressed direct copy of the 37G.

The LZ4 codec in ‘fast’ is so efficient that the cores (18 of, core 0 and 1 I avoid using) only run at 30%. However, selecting higher compression levels push all cores to 100% as expected.

I am probably going to go via this private route in our case as I am pretty happy with sub 2 mins for a backup at 50% source size. The commercial options out there are either meggabucks (Veeam) or a bit unclear as to the performance vs cost, but certainly more costly than free.

I am intending to get RCT working but that’s another big step.

I’ll reply again once I have any data to provide to you.

Ant · November 16, 2025, 8:21pm

I didn’t properly read what you asked me for, I am sorry, you wanted to know whether the limitation was per client or server total. I am sorry, I didn’t test that so I don’t know, but I am pretty sure the limitation would be per physical client instance. I don’t think the code (but I do not know) backs up more than one virtual machine at once per physical host. I am reasonably confident that you can run multiple physical hosts to one UrBackup server simultaneously and these would operate concurrently. Your bottleneck would then come up as your CPU speed to deal with the CPU interrupts from the NIC and the associated Deferred Procedure Calls to process the packets once the CPU had finished with the packet interrupt at a higher level; and obviously the backup media write speed.

With a much greater level of complexity than I could have ever imagined at the start, I have finished my private implementation of RCT / CBT using Win32 P/Invoke and WMI. The code implements VSS for Windows Server and standard WMI exports for Win 10 / 11. In the case of the client side the export is then read back and pumped through the compressor before being written back to the output directory. I wanted to make sure I get the configuration files because these are often missed out. On the Win10/11 OS WMI is the only supported way. Consequently, it’s hop skip and a jump compared to the server side but once that initial backup is completed the incremental block level backups are identical in both cases. There is no “server” just a target proprietary compressed file format for each VM which contains all needed elements from each backup to completely restore the VM to a given backup time point.

Obviously this forum isn’t about my code it’s about URBackup which for allot of use cases is frikin awesome and free of course. I just have a specific case where it doesn’t quite fit for me. I may consider publishing my code at some point, but that comes with overhead which I do not have bandwidth for. . . so I may not.