Slow Image backups with URBACKUP on Kubernetes on TruenasScale

gwurbuser · September 15, 2023, 7:22pm

When performing File backups, I get total throughput close to physical device capability (multiple simultaneous Full file backups each 500Gbps to 1.4Gbps throughput). When performing Image backups, each backup will not exceed 300Gbps and does not slow down until I get 4 or more simultaneous backups (single =250-300Gbps multiple 250-300Gbps each not exceeding 1.3Gbps aggregate. I have tried increasing priority of client and clientbackend services, changing from hash to raw transfers, setting high resource limits or removing resource limits in Kubernetes, and other tweaks with no impact. Have tried compressed transfer and uncompressed. It is like there is a limit on the per client throughput for Image backups that is not there for File backups.

URBACKUP: Server 2.5.31 Docker (uroni/urbackup-server repo), clients 2.5.23-2.5.24,

Server:
MB: ASUS X299 Sage 10G
CPU: Intel 10980XE 18 core/36 threads
RAM: 256GB DDR4 3600 megatransfer
NIC: Intel 550-T2 (LCAP bonded)
Drives: 8x Ultrastar 8TB 6Gbps configured in Z2 ZFS pool (6 drives data and 2 drives redundancy). URBackup running on Z1 ZFS pool of 8x 2TB Samsung 870 SATA SSD. all drives connected to LSI 9300-16i

Clients: mix of Windows and Unix/Linux/Mac (Image only supported on Windows). Two main Windows clients have 2TB (~0.75-1.5TB used) C: and 8TB (~3.5TB used) D: w/ 10Gbps Ethernet i9-1300k and i9-10980XE. Other Windows clients are 300GB-500GB C: i9-11900K.

uroni · September 17, 2023, 6:25pm

Can you somehow find out where the bottleneck is? Client vs. Server. Network vs. disk io etc. ?

Which image file format do you use?

gwurbuser · September 18, 2023, 12:36am

I have tried with both VHDX and VHDZ, but need Z as drives larger than 2TB. No CPU bottleneck client or server. No I/O bottleneck until I have 4 or more concurrent image backups. No network bottleneck as all 10Gbps and maxing under 2Gbps. No indication that any client/server maxing metrics. I thought there might be some Kubernetes limitation, but not sure why file backups would be faster. Only possibility I have not found a way to test is if the file backups are executing in a more parallel fashion and image backups are single threaded. Would have assumed that image would be faster as it can be a continuous stream of blocks vs serries of small and larger files with more lag time. Any possibility this could be at the VSS level?

gwurbuser · September 26, 2023, 12:43am

uroni, are there defaults for max network for internet and for local if the fields are not populated or unlimited and use the fields to limit? I can see some variance for increasing hash threads on the client, but that does not seem to change the max bandwidth used for image backups. I cannot see a pegged core/thread on the server or the client. I set both local/passive and internet/active to 10Gbps with no change. Any single image backup still toping at 300Mbps and can still start 3 additional image backups in parallel without seeing a decrease from 300Mbps. File backups peak closer to 1.2Gbps single backup which seems to be the disk io limit.

gwurbuser · September 26, 2023, 12:50am

Please see my additional comments in this thread. One additional thought… Are image backups done with small block reads? If so, NVMe ssd may be reading large volume of data for each small read and only passing requested blocks. If file backups call by file, the reads may be significantly more optimized for an NVMe ssd.

gwurbuser · September 26, 2023, 1:58am

I finally gave up and installed a new server and copied the default configs over to my server. Image backups now running up to 3Gbps aggregate (1.2Gbps highest single image backup) and exceeding what I thought the disk right would perform. I wish I knew which config was causing the issue.