Here is the iperf3 output of a test from local Windows 2025box to the Linux Urbackup server.
Connecting to host 10.38.1.12, port 5201
[ 5] local 10.38.1.10 port 59205 connected to 10.38.1.12 port 5201
[ ID] Interval Transfer Bitrate
[ 5] 0.00-1.01 sec 1.16 GBytes 9.93 Gbits/sec
[ 5] 1.01-2.01 sec 1.15 GBytes 9.87 Gbits/sec
[ 5] 2.01-3.01 sec 1.15 GBytes 9.91 Gbits/sec
[ 5] 3.01-4.02 sec 1.17 GBytes 9.91 Gbits/sec
[ 5] 4.02-5.01 sec 1.13 GBytes 9.90 Gbits/sec
[ 5] 5.01-6.01 sec 1.15 GBytes 9.91 Gbits/sec
[ 5] 6.01-7.01 sec 1.14 GBytes 9.78 Gbits/sec
[ 5] 7.01-8.01 sec 1.15 GBytes 9.89 Gbits/sec
[ 5] 8.01-9.01 sec 1.15 GBytes 9.91 Gbits/sec
[ 5] 9.01-10.01 sec 1.15 GBytes 9.89 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate
[ 5] 0.00-10.01 sec 11.5 GBytes 9.89 Gbits/sec sender
[ 5] 0.00-10.01 sec 11.5 GBytes 9.89 Gbits/sec receiver
As you can see the links are fully saturated. Since the storage is U.2 based and these servers have 20 cores and 256G Ram they are easily capable of these sort of speeds.
Here is a log of a VM backup involving the same two host computers as in the speed test above.
-
Starting unscheduled full image backup of volume âSCSI:0:0ââŚ
Basing image backup on last incremental or full image backup
Error retrieving last image backup. Doing full image backup instead.
Loading ZIP metadata from urbackup/mbr_data_y ADRoot Dc1âSCSI_0_0.zip
Server[y AdRoot DC1]: Loading MBR zip fileâŚ
Server[y AdRoot DC1]: Loaded MBR zip file (9.97168 KB)
Transferred 23.325 GB - Average speed: 813.435 MBit/s
Time taken for backing up client Server[y AdRoot DC1]: 4m 29s
Backup succeeded
I am interested in the 813MB/s statistic since this is 8% of the network capability. Itâs a local copy, two machines next to each other. No compression, no encryption. So 92% of my network is unused in a backup scenario it seems. When I look at the disk reads they are very small. The disk is reading around 150Mb/s but only for a second or so and then it stops while presumably that chunk is transmitted to the server. The Disk is capable of reading at 3000Mb/s.
In frustration I pasted the ClientSend code into claude.ai in the context of the above problem and observations.
Looking at this ClientSend code, I can see some potential performance issues:
Key Problems:
-
Single-threaded sending: This appears to be a single sender thread that processes one buffer at a time from a queue. No parallel sending.
-
Small buffer management: Uses
CBufMgr2(nbufs, bsize)- the buffer size (bsize) and number of buffers (nbufs) are likely too small for high-speed networks. -
Synchronous writes with timeout:
pipe->Write(item.buf, item.bsize, 60000, false)- thefalseparameter suggests synchronous writes, waiting for each buffer to complete before sending the next. -
Queue-based architecture: Uses a simple queue with condition variables, which adds overhead vs direct streaming.
-
Frequent flushing: Calls
pipe->Flush(-1)multiple times, which forces TCP to send data immediately rather than batching efficiently.
Whatâs missing for 10G performance:
-
No TCP_NODELAY/buffering optimization
-
No async I/O or overlapped operations
-
No multiple parallel streams
-
Likely small buffer sizes (typical defaults are 64KB-1MB, but 10G needs much larger)
The architecture is fundamentally designed for slower networks. For 10G speeds, youâd want:
-
Multiple sender threads
-
Large buffers (8-32MB)
-
Async/overlapped I/O
-
Proper TCP window scaling
This explains perfectly why youâre stuck at 800 Mbps - itâs a classic single-threaded, small-buffer network implementation.
I also had Claude.ai take a look at the Connector code and here is itâs response. . .
Looking at this Connector code, I can see more evidence of urbackupâs networking limitations:
Key networking issues in this code:
-
Simple socket implementation: Uses basic
socket(),connect(),send(),recv()calls without any optimization -
No TCP optimization: No evidence of:
-
TCP_NODELAYto disable Nagleâs algorithm -
SO_SNDBUF/SO_RCVBUFfor larger buffer sizes -
Window scaling options
-
Any modern TCP performance tuning
-
-
Synchronous I/O: The
recv()call in a simple while loop:
cpp
int rc=recv(p, buffer, 1024, MSG_NOSIGNAL);
-
Small 1024-byte buffer
-
Blocking receive calls
-
No async/overlapped I/O
-
Basic protocol: Uses a simple text-based protocol with
CTCPStack.Send()- likely not optimized for bulk data transfer -
Single connection per operation: Creates one socket, sends command, receives response, closes socket - no connection pooling or persistent connections
This explains the 800 Mbps ceiling perfectly:
-
1024-byte receive buffer is tiny for 10G networks
-
No TCP tuning means default (conservative) TCP window sizes
-
Synchronous blocking I/O prevents overlapping network and disk operations
-
Text-based protocol has overhead vs binary protocols
This is classic âworks fine on 100 Mbps networksâ code that falls apart at 10G speeds. The architecture is fundamentally designed for low-speed networks and would need a complete rewrite to handle modern high-speed networking properly.
This a very frustrating to have burned well over a week of time into this, only to find this huge limitation in the design architecture. Sigh.