Server 2.5.23/Client 2.5.17 - Backups hang on server on client standby

As always, thank you for the great tool which works great. This forum is usually used to report issues or ask for help, however happy customers are not coming here and they are the majority to my understanding. I’m one of the happy customers.

Description:
Server has multiple clients which run backups. If client is getting switched off (standby and hibernate) during the backup procedure it hangs on server. Even when client is resumed after few hours - server is unable to pick it up and either cancel or resume backup. Those backups hangs forever on server. Client reports no connection to server.

Steps to reproduce:

  1. Setup client and server to run backups.
  2. Start backup. For me, hanging type was image backup. For file backups I cannot see such behavior yet.
  3. After backup has started - set client to standby mode.
  4. Wait for few hours.
  5. Resume client.

Expected behaviour:
Backup which is in progress is either picked up where it left on client resume or considered failed after some timeout or when client is back online.

Actual behaviour:
Server hangs in endless loop in trying to reconnect to the client. Backups never fail or resume - server restart is required. Other clients are also not being backed up.

Logs on server:

Mar 11 10:40:59 apparel urbackupsrv[3560]: Socket has error: 113
Mar 11 10:40:59 apparel urbackupsrv[3560]: Socket has error: 113
Mar 11 10:40:59 apparel urbackupsrv[3560]: Connecting Channel to CLIENT1 failed - CONNECT error -55
Mar 11 10:40:59 apparel urbackupsrv[3560]: Connecting to ClientService of "CLIENT1" failed: Error sending 'running' (2) ping to client
Mar 11 10:40:59 apparel urbackupsrv[3560]: Error sending 'running' (3) ping to client
Mar 11 10:41:04 apparel urbackupsrv[3560]: Connecting Channel to CLIENT2 failed - CONNECT error -55
Mar 11 10:41:04 apparel urbackupsrv[3560]: Connecting Channel to CLIENT3 failed - CONNECT error -55
Mar 11 10:41:04 apparel urbackupsrv[3560]: Connecting Channel to CLIENT4 failed - CONNECT error -55
Mar 11 10:41:04 apparel urbackupsrv[3560]: Connecting to ClientService of "CLIENT4" failed: Error sending 'running' (2) ping to client
Mar 11 10:41:04 apparel urbackupsrv[3560]: Error sending 'running' (3) ping to client
Mar 11 10:41:06 apparel urbackupsrv[3560]: Socket has error: 113
Mar 11 10:41:06 apparel urbackupsrv[3560]: Connecting to ClientService of "CLIENT2" failed: Error sending 'running' (2) ping to client
Mar 11 10:41:06 apparel urbackupsrv[3560]: Socket has error: 113
Mar 11 10:41:06 apparel urbackupsrv[3560]: Connecting Channel to CLIENT2 failed - CONNECT error -55
Mar 11 10:41:06 apparel urbackupsrv[3560]: Error sending 'running' (3) ping to client

On client:

Debug logs on the client:

2022-03-10 17:42:49: ClientService cmd: STATUS DETAIL#pw=PASSWORD_REMOVED
2022-03-10 17:42:50: ClientService cmd: STATUS DETAIL#pw=PASSWORD_REMOVED
2022-03-10 17:42:51: ClientService cmd: STATUS DETAIL#pw=PASSWORD_REMOVED
2022-03-10 17:42:52: ClientService cmd: STATUS DETAIL#pw=PASSWORD_REMOVED
2022-03-10 17:42:53: ClientService cmd: STATUS DETAIL#pw=PASSWORD_REMOVED
2022-03-10 17:42:54: ClientService cmd: STATUS DETAIL#pw=PASSWORD_REMOVED
2022-03-10 17:42:55: ClientService cmd: STATUS DETAIL#pw=PASSWORD_REMOVED
2022-03-10 17:42:56: ClientService cmd: STATUS DETAIL#pw=PASSWORD_REMOVED
2022-03-10 17:42:57: ClientService cmd: STATUS DETAIL#pw=PASSWORD_REMOVED
2022-03-10 17:42:58: ClientService cmd: STATUS DETAIL#pw=PASSWORD_REMOVED
2022-03-10 17:42:59: ClientService cmd: STATUS DETAIL#pw=PASSWORD_REMOVED
2022-03-10 17:43:00: ClientService cmd: STATUS DETAIL#pw=PASSWORD_REMOVED
2022-03-10 17:43:01: ClientService cmd: STATUS DETAIL#pw=PASSWORD_REMOVED
2022-03-10 17:43:02: ClientService cmd: STATUS DETAIL#pw=PASSWORD_REMOVED
2022-03-10 17:43:03: ClientService cmd: STATUS DETAIL#pw=PASSWORD_REMOVED
2022-03-10 17:43:04: ClientService cmd: STATUS DETAIL#pw=PASSWORD_REMOVED
2022-03-10 17:43:05: ClientService cmd: STATUS DETAIL#pw=PASSWORD_REMOVED

Sry, cannot reproduce this issue.

It should debug log “CLIENTNAME: Trying to reconnect in doImage” every minute. I guess the log excerpt is too small to see that. When the client goes online again it should also discover this client (and log something about that) and show that it is online again on e.g. the status page on the server.

It doesn’t hang, it tries to resume the imaging process. Yes, that takes away a backup slot and if you have only a limited amount of backup slots that may prevent other backups from starting. After 10 days of this it fails the image backup.

You’re right and those messages are there, I didn’t know that they are meaningful, but please, let me insert bigger log chunk here:

2022-03-11 10:41:06: Error sending 'running' (3) ping to client
2022-03-11 10:41:12: Socket has error: 113
2022-03-11 10:41:12: Socket has error: 113
2022-03-11 10:41:12: Connecting Channel to CLIENT1 failed - CONNECT error -55
2022-03-11 10:41:12: Connecting to ClientService of "CLIENT1" failed: Error sending 'running' (2) ping to client
2022-03-11 10:41:12: Error sending 'running' (3) ping to client
2022-03-11 10:41:12: CLIENT2: Trying to reconnect in doImage
2022-03-11 10:41:14: Connecting Channel to CLIENT3 failed - CONNECT error -55
2022-03-11 10:41:14: Connecting Channel to CLIENT4 failed - CONNECT error -55
2022-03-11 10:41:14: Connecting Channel to CLIENT2 failed - CONNECT error -55
2022-03-11 10:41:14: Connecting to ClientService of "CLIENT2 " failed: Error sending 'running' (2) ping to client
2022-03-11 10:41:14: Error sending 'running' (3) ping to client
2022-03-11 10:41:19: Socket has error: 113
2022-03-11 10:41:19: Connecting Channel to CLIENT5 failed - CONNECT error -55
2022-03-11 10:41:19: Socket has error: 113
2022-03-11 10:41:19: Connecting to ClientService of "CLIENT5 " failed: Error sending 'running' (2) ping to client
2022-03-11 10:41:19: Error sending 'running' (3) ping to client
2022-03-11 10:41:20: CLIENT1 : Trying to reconnect in doImage
2022-03-11 10:41:23: Socket has error: 113
2022-03-11 10:41:23: Socket has error: 113
2022-03-11 10:41:23: Connecting to ClientService of "CLIENT1" failed: Error sending 'running' (2) ping to client
2022-03-11 10:41:23: Connecting Channel to CLIENT1 failed - CONNECT error -55
2022-03-11 10:41:23: Socket has error: 113
2022-03-11 10:41:23: Error sending 'running' (3) ping to client
2022-03-11 10:41:24: Connecting Channel to CLIENT3 failed - CONNECT error -55
2022-03-11 10:41:24: Connecting Channel to CLIENT4 failed - CONNECT error -55
2022-03-11 10:41:24: Connecting Channel to CLIENT2 failed - CONNECT error -55
2022-03-11 10:41:24: Connecting to ClientService of "CLIENT2 " failed: Error sending 'running' (2) ping to client
2022-03-11 10:41:24: Error sending 'running' (3) ping to client
2022-03-11 10:41:27: CLIENT5: Trying to reconnect in doImage

They are there, but this has no effect. Client appears as online from the server prospective (I cannot see relevant messages on the sever side, even in debug mode), i.e. server detects the client, however from the client prospective there is no server.

As steps to reproduce, please try to have client switched off for, let’s say 8 hours? Because in reality it was sleep-over-night.

Perhaps you can check if the server can ping the client when this happens? One method is to put the client into client discovery hints. The other is to run wireshark on the client and look if it receives the broadcasts from the server.

It seems that I was able to pin this down. Context:

  • I have FQDN for UrBackup server in LAN
  • Default settings were to connect via that FQDN for all clients
  • In LAN that FQDN resolved in local IP, in internet it was resolved in external IP.
  • My logic was the following: sometimes it takes time for server-client connection via broadcasting, probably enabling internet server on local clients would speed up that process. And it did.

So what happened was:

  • Client/Server connect via broadcast or via internet active connection.
  • Backup starts
  • Client goes to sleep
  • Client wakes up, but now it’s connected via internet connection using FQDN, instead of broadcast group. Or vice versa. For server it’s the same client, however using new channel to continue backups is not possible - it leads to above errors.

I’ve removed active internet connections for local clients and this seems to have fixed the issue so far, i.e. after sleep client successfully continues backup and server is able to reconnect at the last state.

Likely a warning is needed on the client/server side, should such situation occur, i.e. when broadcasted IP and active connection IP is the same.

But probably you have a better idea to resolve this. The issue was floating and drove me nuts, as I was unable to catch it.

Unfortunately issue has returned even after disabling internet mode for local clients.

I’ve made a dump for UDP port 35623 from the client side when issue occurs. PFA PCAP file. I had to change it’s extension as forum doesn’t allow me to attach PCAP file. So, please, change it’s extension to pcap and you should be able to open it with wireshark.

Client in CentOS 7. Same issue is observed for Windows clients as well.

failed_urbackup.txt (19.0 KB)

Restarting the client - fixes the issue.

I hope this would be helpful.