Large number of internet clients

vow · September 23, 2020, 2:25pm

So we currently have about 1600 clients over the internet we are backing up and adding daily.

Are there specific changes we should consider with that many internet clients?

What we are seeing is even when idle (no backups or cleanup) the server’s CPU is using 35-45%. I think do to all the internet clients checking in every min it seems when looking at packet captures. Can this interval not be increased to reduce the load on the server. I would think if they checked in every 5 to 15 minutes that should be fine?

Also, is there anything we can do to load balance these backups as we eventually outgrow what a single server can do in a night? I can’t see how we can point them all to a single destination and be able to load balance it. almost like we will manually have to just point new clients to the next server?

Another thing that i cannot find is what is the impact of the “Database cache size during batch processing” setting? Meaning is there a recommended amount based on how much ram you have, how large the databases are, etc? Have 32 GB in the server and only do file backups. Data is relatively small per client, but it still becomes a lot of files.

Any pointers would be helpful as we keep scaling.

Thanks!

Dmitrius7 · September 23, 2020, 9:58pm

Hi!

Sorry, I don’t have answers to your questions.

But I’m very interested what filesystem do you use for storage? How did you optimize iowait perfomance?

I have a few UrBackup servers.
The biggest one have 40 clients. All of them are internet clients (mostly windows clients). Creates file and image backup. Data size for backup 100-700 Gb per client.
Total storage size is 10Tb. Allocated size is 8,6Tb.

The problem in high iowait. When 10 clients are creating backups iowait is very high 30%-60%.

You have 1600 clients and I think at your server more then 50 clients are creating backups at same time. How did you solve this problem?

Thank you in advance!

orogor · September 24, 2020, 10:27am

Hello

Personally, i would split the number of clients on multiple servers in manageable chunks, by office location, or business unit or whatnot.
If you ever want to run a maintenance operation at one point, it can become a multiples day downtime.

The database file needs to be an ssd, it s too much pain otherwise

What we do is that we have 3 groups of 2 servers and all clients backups on the two servers.
Both servers are also referenced as internet servers, so when doing remote, the connection works as well. But you could use yet a different server as an internet server.

From the doc, this option is supposed to help spread the load, i didn’t tried it.
Max number of recently active clients
This option limits the number of clients the server accepts. An active client is a client the server has seen in the last two month. If you have multiple servers in a network you can use this option to balance their load and storage usage. Default: 100.

A client can connect to i guess an unlimited number of local servers, but a single internet server.
Local means reachable via udp broadcast on the subnet.

Each server can push it’s own policy when the client connect to it. so you can backup on different days on each servers or backup different folders, use different retention and whatnot.

The commercial versions might be worth to investigate , because apparently some have a federation support if you have a lot of servers. But i didn’t tried them.

patt · September 25, 2020, 6:00pm

I am interested too! I would limit number of paralell backups!
My old server, opteron 12 threads, btrfs filesystem 48 TB disk in soft raid 6 old hba(linux md, not btrfs) 46 GB RAM can handle 2 paralell backups safely. I had aprox 50 internet only clients, but some of them had TB-s to backup. When I increased paralell backups, the slowdown was bigger, and backing up took much more time, than 2 threads, and sequentially backing up everything.
Check how long takes to backup an average client, divide it with paralell task and multiply with number of clients and you will get an aproximate time to backup everything.
We testing a new server with 40 threads, 128 gb ram, 128TB disk with zfs. 10 gbps network interface. Paralell backup limit is 5. My first opinion is, new zfs on linux performs great! But we just started testing, so dont know yet.

vow · September 28, 2020, 12:46pm

We limit to 10 backups at a time, but they run very quickly as we are only doing file backups and the daily diff is pretty small. We are fortunate that we have customers split all across the country so we stagger them by time zone to get them to complete.

I would like to try and change the check in interval somehow because i think that is a lot of the overhead for really nothing. If they checked in ever 5 minutes, that would be a 5x reduction in chatter. But maybe there is a reason for checking in so often, but i dont see it.

I get splitting the customers onto another server, but that just doesn’t seem efficient for support and long term growth. Some type of shared storage and then a couple servers in front to handle the request and buffer the storage that is to be written to disk. We have database on SSD’s and that did help greatly, but the database is really getting large, so like to know what is the impact of the “Database cache size during batch processing” setting? What should it be set to? Some type ratio of database size, amount of memory, etc.

We can manage without the clustering, but freeing up resources just for client check-in’s i think would help, as it seems like a waist.

uroni · September 28, 2020, 3:00pm

You didn’t say which operating system you are using yet. If it’s Linux I’d advise compiling it from sources (or installing debug symbols), then looking at perf top -g or attach/send a perf record -g. I’m sure the same is possible on Windows somehow with ETW tracing…
That would narrow down, where it spends the CPU time.

It checks every 5min if it should run a backup ( urbackup_backend/urbackupserver/ClientMain.cpp at dev · uroni/urbackup_backend · GitHub ), but, as said, it should really be confirmed if this is causing the CPU usage. For example the connections also have regular keep-alives (ping/pong) and there is encryption overhead for starting new connections. Or there is the thread updating the list of clients being online (looking at a htop with thread names would be helpfull as well!).

The main fix would, I guess, be to implement a feature that takes the clients offline while they are not running backups. It wouldn’t be possible to initiate backups/restores from the server anymore, so this comes with disadvantages. Would this solve the problem or is e.g. the backup interval too small?

This setting has next to no impact currently.