connection closed by remote

Question

connection closed by remote

ThomasWaldmann opened this issue 8 years ago · comments

TW commented 8 years ago

sometimes a borg backup gets interrupted with "connection closed by remote".

it is unclear why this happens or whether this is a problem in borg, in the ssh configuration or on the network layer.

Suspicions:

ssh / tcp keepalive configuration (this was pointed out on the atticmatic site, but it did not help in a experiment a borg user did), so it either does not help or is sometimes not sufficient (from IRC: Azelphur$ FYI, TCPKeepAlive yes, ServerAliveInterval 60, ServerAliveCountMax 5 didn't work :( )
See also: http://www.cyberciti.biz/tips/open-ssh-server-connection-drops-out-after-few-or-n-minutes-of-inactivity.html
borg: EINTR and in general exception handling maybe could be improved in remote.py (though it is unclear what would cause EINTR in our case and would not terminate the process anyway). See there for details: http://250bpm.com/blog:12
there might be longer times with no or only little data transfer (e.g. if borg is reading over a lot of zeros in a sparse file or if all the chunks are already stored in the repo - e.g. when backing up VMs)
a router might be configured in its security/DoS settings to do something special in case a lot of packets arrive - could this be triggered by borg?
also (trivial), it might be just a unstable network connection that breaks down

TW commented 5 years ago

@m3nu ^

TW · Answer 1 · Sun Feb 07 2016 05:45:31 GMT+0800 (China Standard Time)

Idea for maybe getting more infos from the server side: instead of relying on the log messages that come over the wire (or rather don't come if the connection gets killed), configure the borg on the server to use logging to a local file.

logging.conf: https://paste.thinkmo.de/Nu7rdXvz#logging.conf

export BORG_LOGGING_CONF=/path/to/that/logging.conf
^ that needs to be put into the environment ssh users (like the borg executable) get.
(not sure whether we need a borg wrapper script here or a sshd config change)

Alfie Day · Answer 2 · Tue Feb 09 2016 02:23:13 GMT+0800 (China Standard Time)

The borg log sadly doesn't actually contain any useful information, it just says

2016-02-08 19:14:54,477 borg.logger DEBUG using logging configuration read from "/home/azelphur/logging.conf"

Here is the log from the client side, running with BORG_RSH="ssh -v"

# BORG_RSH="ssh -v" /home/azelphur/.borg-backup
Backing up to ssh://azelphur@backup.azelphur.com:23/home/azelphur/darth-vader.borg
Remote: OpenSSH_6.9p1 Ubuntu-2ubuntu0.1, OpenSSL 1.0.2d 9 Jul 2015
Remote: debug1: Reading configuration data /etc/ssh/ssh_config
Remote: debug1: /etc/ssh/ssh_config line 19: Applying options for *
Remote: debug1: Connecting to backup.azelphur.com [176.9.24.240] port 23.
Remote: debug1: Connection established.
Remote: debug1: permanently_set_uid: 0/0
Remote: debug1: identity file /root/.ssh/id_rsa type 1
Remote: debug1: key_load_public: No such file or directory
Remote: debug1: identity file /root/.ssh/id_rsa-cert type -1
Remote: debug1: key_load_public: No such file or directory
Remote: debug1: identity file /root/.ssh/id_dsa type -1
Remote: debug1: key_load_public: No such file or directory
Remote: debug1: identity file /root/.ssh/id_dsa-cert type -1
Remote: debug1: key_load_public: No such file or directory
Remote: debug1: identity file /root/.ssh/id_ecdsa type -1
Remote: debug1: key_load_public: No such file or directory
Remote: debug1: identity file /root/.ssh/id_ecdsa-cert type -1
Remote: debug1: key_load_public: No such file or directory
Remote: debug1: identity file /root/.ssh/id_ed25519 type -1
Remote: debug1: key_load_public: No such file or directory
Remote: debug1: identity file /root/.ssh/id_ed25519-cert type -1
Remote: debug1: Enabling compatibility mode for protocol 2.0
Remote: debug1: Local version string SSH-2.0-OpenSSH_6.9p1 Ubuntu-2ubuntu0.1
Remote: debug1: Remote protocol version 2.0, remote software version OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.4
Remote: debug1: match: OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.4 pat OpenSSH_6.6.1* compat 0x04000000
Remote: debug1: Authenticating to backup.azelphur.com:23 as 'azelphur'
Remote: debug1: SSH2_MSG_KEXINIT sent
Remote: debug1: SSH2_MSG_KEXINIT received
Remote: debug1: kex: server->client chacha20-poly1305@openssh.com <implicit> none
Remote: debug1: kex: client->server chacha20-poly1305@openssh.com <implicit> none
Remote: debug1: expecting SSH2_MSG_KEX_ECDH_REPLY
Remote: debug1: Server host key: ecdsa-sha2-nistp256 SHA256:TZNuG4BKyQfl8EO5Jv+Ld7D0CWeS5Jzr5P/16f4svxI
Remote: debug1: Host '[backup.azelphur.com]:23' is known and matches the ECDSA host key.
Remote: debug1: Found key in /root/.ssh/known_hosts:1
Remote: debug1: SSH2_MSG_NEWKEYS sent
Remote: debug1: expecting SSH2_MSG_NEWKEYS
Remote: debug1: SSH2_MSG_NEWKEYS received
Remote: debug1: SSH2_MSG_SERVICE_REQUEST sent
Remote: debug1: SSH2_MSG_SERVICE_ACCEPT received
Remote: debug1: Authentications that can continue: publickey,password
Remote: debug1: Next authentication method: publickey
Remote: debug1: Offering RSA public key: /root/.ssh/id_rsa
Remote: debug1: Server accepts key: pkalg ssh-rsa blen 279
Remote: debug1: Authentication succeeded (publickey).
Remote: Authenticated to backup.azelphur.com ([176.9.24.240]:23).
Remote: debug1: channel 0: new [client-session]
Remote: debug1: Requesting no-more-sessions@openssh.com
Remote: debug1: Entering interactive session.
Remote: debug1: Sending environment.
Remote: debug1: Sending env LANG = en_GB.UTF-8
Remote: debug1: Sending command: borg serve --umask=077 --info
reading files cache
processing files
Remote: debug1: need rekeying
Remote: debug1: SSH2_MSG_KEXINIT sent
Remote: debug1: rekeying in progress
Remote: debug1: rekeying in progress
Remote: debug1: rekeying in progress
Remote: debug1: rekeying in progress
Remote: debug1: rekeying in progress
Remote: debug1: rekeying in progress
Remote: debug1: rekeying in progress
Remote: debug1: rekeying in progress
Remote: debug1: rekeying in progress
Remote: debug1: rekeying in progress
Remote: debug1: rekeying in progress
Remote: debug1: rekeying in progress
Remote: debug1: rekeying in progress
Remote: debug1: rekeying in progress
Remote: debug1: SSH2_MSG_KEXINIT received
Remote: debug1: kex: server->client chacha20-poly1305@openssh.com <implicit> none
Remote: debug1: kex: client->server chacha20-poly1305@openssh.com <implicit> none
Remote: debug1: expecting SSH2_MSG_KEX_ECDH_REPLY
Remote: debug1: rekeying in progress
Remote: debug1: rekeying in progress
Remote: debug1: Server host key: ecdsa-sha2-nistp256 SHA256:TZNuG4BKyQfl8EO5Jv+Ld7D0CWeS5Jzr5P/16f4svxI
Remote: debug1: set_newkeys: rekeying
Remote: debug1: SSH2_MSG_NEWKEYS sent
Remote: debug1: expecting SSH2_MSG_NEWKEYS
Remote: debug1: set_newkeys: rekeying
Remote: debug1: SSH2_MSG_NEWKEYS received
Remote: packet_write_wait: Connection to 176.9.24.240: Broken pipe
Connection closed by remote host

TW · Answer 3 · Tue Feb 09 2016 04:23:12 GMT+0800 (China Standard Time)

ok, if that 1 line is all you have in the server log, we at least know it is not the borg process on the server side that encounters some exception or other error - that would be in that log.

so, i guess you have to go through the list of suspicions from my first post, especially the ones relating to ssh itself and the network layer.

another idea is to enhance logging on both sides with a logging.conf and include the timestamps, so we get timing information.

also interesting would be to add another piece of information to the protocol of the experiments:

if you do a first backup, there is a lot of traffic going over the network.
if you do a 2nd+ backup of same or similar data, there is way less data going over the network
That difference might be important for debugging this problem. It would be interesting whether the issue happens in both cases or not.

Alfie Day · Answer 4 · Sat Feb 13 2016 01:35:31 GMT+0800 (China Standard Time)

ssh / tcp keepalive configuration (this was pointed out on the atticmatic site, but it did not help in a experiment a borg user did), so it either does not help or is sometimes not sufficient (from IRC: Azelphur$ FYI, TCPKeepAlive yes, ServerAliveInterval 60, ServerAliveCountMax 5 didn't work :( )
See also: http://www.cyberciti.biz/tips/open-ssh-server-connection-drops-out-after-few-or-n-minutes-of-inactivity.html

To confirm, I do indeed have

Host *
    ServerAliveInterval 60
    ServerAliveCountMax 5
    TCPKeepAlive yes

In my /etc/ssh/ssh_config file, however the issue still occurs.

there might be longer times with no or only little data transfer (e.g. if borg is reading over a lot of zeros in a sparse file or if all the chunks are already stored in the repo - e.g. when backing up VMs)

also (trivial), it might be just a unstable network connection that breaks down

if you do a 2nd+ backup of same or similar data, there is way less data going over the network That difference might be important for debugging this problem. It would be interesting whether the issue happens in both cases or not.

To address these questions, I created 10 random files and then tried to backup these files (twice) while running mtr, here's the results

azelphur@darth-vader:~/tmp/pointless-file$ openssl rand -out sample2.txt -base64 $(( 2**30 * 3/4 ))
azelphur@darth-vader:~/tmp/pointless-file$ openssl rand -out sample3.txt -base64 $(( 2**30 * 3/4 ))
azelphur@darth-vader:~/tmp/pointless-file$ openssl rand -out sample4.txt -base64 $(( 2**30 * 3/4 ))
azelphur@darth-vader:~/tmp/pointless-file$ openssl rand -out sample5.txt -base64 $(( 2**30 * 3/4 ))
azelphur@darth-vader:~/tmp/pointless-file$ openssl rand -out sample6.txt -base64 $(( 2**30 * 3/4 ))
azelphur@darth-vader:~/tmp/pointless-file$ openssl rand -out sample7.txt -base64 $(( 2**30 * 3/4 ))
azelphur@darth-vader:~/tmp/pointless-file$ openssl rand -out sample8.txt -base64 $(( 2**30 * 3/4 ))
azelphur@darth-vader:~/tmp/pointless-file$ openssl rand -out sample9.txt -base64 $(( 2**30 * 3/4 ))
azelphur@darth-vader:~/tmp/pointless-file$ openssl rand -out sample0.txt -base64 $(( 2**30 * 3/4 ))
azelphur@darth-vader:~/tmp/pointless-file$ du -sh *
1.1G    sample0.txt
1.1G    sample2.txt
1.1G    sample3.txt
1.1G    sample4.txt
1.1G    sample5.txt
1.1G    sample6.txt
1.1G    sample7.txt
1.1G    sample8.txt
1.1G    sample9.txt
1.1G    sample.txt
azelphur@darth-vader:~/tmp/pointless-file$ borg create --compression lz4 -sv --exclude-caches ssh://azelphur@backup.azelphur.com:23/home/azelphur/test.borg::`hostname`-`date "+%Y-%m-%d_%H:%M:%S"` /home/azelphur/tmp/pointless-file
reading files cache
processing files
Remote: packet_write_wait: Connection to 176.9.24.240: Broken pipe
Connection closed by remote host
azelphur@darth-vader:~/tmp/pointless-file$ borg create --compression lz4 -sv --exclude-caches ssh://azelphur@backup.azelphur.com:23/home/azelphur/test.borg::`hostname`-`date "+%Y-%m-%d_%H:%M:%S"` /home/azelphur/tmp/pointless-file
reading files cache
processing files
Remote: packet_write_wait: Connection to 176.9.24.240: Broken pipe
Connection closed by remote host

and the output from mtr

                             My traceroute  [v0.85]
darth-vader (0.0.0.0)                                  Fri Feb 12 16:40:12 2016
Keys:  Help   Display mode   Restart statistics   Order of fields   quit
                                       Packets               Pings
 Host                                Loss%   Snt   Last   Avg  Best  Wrst StDev
 1. router.asus.com                   0.0%   411    0.3   0.3   0.2   0.9   0.0
 2. lns12.the.dsl.enta.net            0.0%   411   10.9  13.3   9.0  58.5   7.0
 3. te3-3.the.dist.dsl.enta.net       0.0%   411   11.3  13.7   9.2  91.9   7.9
 4. te2-2.telehouse-east3.dsl.enta.n  0.0%   411   10.0  13.7   9.3 116.0   8.6
 5. te5-6.telehouse-east2.core.enta.  1.2%   411   11.4  13.9   9.7  56.0   6.9
 6. te0-1-0-1.telehouse-east4.core.e  0.0%   411   11.5  13.8  10.1  38.1   6.2
 7. te5-3.amsterdam.core.enta.net     0.0%   411   16.5  19.3  15.3  54.8   6.7
 8. amsix-gw.hetzner.de               0.0%   411   24.7  28.3  24.1  72.4   7.0
 9. core1.hetzner.de                  0.2%   411   24.5  28.0  23.9  65.2   6.9
10. core22.hetzner.de                 0.0%   411   29.4  32.6  28.9  56.3   6.4
11. juniper1.rz15.hetzner.de          0.0%   411   30.2  32.9  28.7  82.8   7.4
12. hos-tr1.ex3k2.rz15.hetzner.de     0.0%   410   30.2  33.7  29.2  66.2   6.4
13. backup.azelphur.com               0.0%   410   29.1  32.2  28.7  62.9   6.1

a router might be configured in its security/DoS settings to do something special in case a lot of packets arrive - could this be triggered by borg?

My router is an Asus RT-AC87R running AsusWRT-Merlin, DoS protection is disabled and QoS is also off.

TW · Answer 5 · Sat Feb 13 2016 03:15:40 GMT+0800 (China Standard Time)

@Azelphur just as a weird idea: could you remove these ssh options (both on server and on client side, restart sshd afterwards) and run your random-files test again?

TW · Answer 6 · Fri Feb 19 2016 14:28:11 GMT+0800 (China Standard Time)

jborg/attic#337
jborg/attic#337 (comment) (increase interval)

http://z9.io/2008/12/10/how-to-fix-ssh-timeout-problems/ (esp. first comment of Kimmo)

jungle-boogie · Answer 7 · Fri Feb 19 2016 14:33:41 GMT+0800 (China Standard Time)

yes, those settings can help but if there's no response, the session can still be terminated. I would also say remove the settings and attempt to leave ssh connected and see what the results are, outside of borg.

Alfie Day · Answer 8 · Mon Aug 01 2016 19:27:41 GMT+0800 (China Standard Time)

Ok...here I am, 6 months later. I finally solved it.

There was an issue with my router, I was running an app on the router which does speedtests every 15 minutes. After disabling this app, the problem no longer occurs. Not sure exactly why the issue occurred beyond that, but if anyone else gets this issue and happens to be running some services on their router - disable those services and see if you still get it. :)

TW · Answer 9 · Mon Aug 01 2016 20:53:47 GMT+0800 (China Standard Time)

@Azelphur thanks for the feedback.

Raphaël Droz · Answer 10 · Wed Oct 05 2016 11:58:14 GMT+0800 (China Standard Time)

Adding to this topic (unless you prefer a new issue) that killing the ssh/gpg-agent will make borg 1.0.7 create hanging indefinitely (strace shows "select() Timeout") with no warning (at least, in non-verbose mode).
Reproducer: run a borg create then kill the ssh-agent and wait a bit.

enkore · Answer 11 · Wed Oct 05 2016 18:20:31 GMT+0800 (China Standard Time)

That's strange. We tested this with killing the ssh client, which works as expected (ie. terminates with an error message).

TW · Answer 12 · Wed Oct 05 2016 19:57:53 GMT+0800 (China Standard Time)

@drzraf please give the precise command how you do the kill.

Raphaël Droz · Answer 13 · Wed Oct 05 2016 22:08:33 GMT+0800 (China Standard Time)

Sorry for the false flag. The timeouts seen were not a sign of hanging (it was indeed uploading): the process was still ongoing. In case these "select() timeout" are of interest anyway here is my command:

BORG_PASSPHRASE=xxx borg create --remote-path=/usr/local/bin/borg1/borg1 --exclude-from xxx --exclude-caches --one-file-system --compression zlib --stats --debug "remote:main-backup::2016-10-04 04:24:03.438923" mydir

(I use such a backup name in order to "resume" from the previous borg create where I initially used {now})

The fact I observed was that remote repository size was not growing after multiple resumed borg create runs.
But the real reason was that it was always interrupted (screenlocker/kill ssh-agent) before the next file (~ 4 GB size) could terminate successfully and since borg can't resume file upload (as of 1.0.7), it was always "removing" this partial file and restarting.
My hypothesis about borg <-> ssh-agent was wrong. Sorry for the noise.

TW · Answer 14 · Wed Oct 05 2016 23:01:45 GMT+0800 (China Standard Time)

@drzraf you do not need to use a special name to "resume" a past backup. just create a new backup with any name you like and it will use what is already in the repo (so this feels like a resume operation, but technically it is not).

rugk · Answer 15 · Mon Oct 31 2016 23:07:37 GMT+0800 (China Standard Time)

IMHO you cannot prevent this type of error. As one user pointed out it may have to do with the router or - e.g. - with VPNs/proxies whatever...

In any case I would much more like it, if Bork is able to circumvent this issue by automatically retrying the backup (maybe 3 or a configurable number of) times.
This way this error is much less serious and backups should still work.

acwatkins · Answer 16 · Sun Mar 12 2017 05:29:45 GMT+0800 (China Standard Time)

I am getting this error consistently directly connected to a gigabit switch. Doesn't seem to be happening on the same file though. (I just reran it and it redid a bunch of files and went past the one that it hung on before.

rugk · Answer 17 · Sun Jul 23 2017 00:39:00 GMT+0800 (China Standard Time)

BTW I actually made a small shell wrapper script for borg, which circumvents this issue by just retrying the backup (as I suggested before). See: https://github.com/rugk/borg-cron-helper

Marcel Mindemann · Answer 18 · Sun Apr 22 2018 18:30:53 GMT+0800 (China Standard Time)

Thank you @rugk. I am also affected by randomly dropped connections so your script is very helpful. If anyone wants to automate their backups without an additional dependency, the most barebone approach would be:
$ until $(borg create <repository> <files>); do :; done
which restarts the command until borg exits successfully with exit code 0. This makes it impossible to use Ctrl-C to cancel a running backup, so use Ctrl-Z followed by a $ kill %1 (or whatever job id the subshell has).

If you want to control how often to restart the backup or include some exponential backoff time between attempts, look at https://github.com/kadwanev/retry, which is a very small bash retry script that is generally useful.

Mike Nye · Answer 19 · Tue Apr 02 2019 09:36:55 GMT+0800 (China Standard Time)

My comment on issue #3988 might also help. Specifically:

In the server's (the machine running borg serve) /etc/ssh/sshd_config file:

ClientAliveInterval 10
ClientAliveCountMax 30

This will cause the server to send a keep alive to the client every 10 seconds. If 30 consecutive keepalives are sent without a response, the connection will be terminated.

If you then run your borg commands with --lock-wait 600, this gives sufficient time for the borg serve processes to exit after the SSH connection is torn down after the 300 second (5 minute) wait for the keepalives to fail.

Mike Nye · Answer 20 · Wed Apr 10 2019 10:01:23 GMT+0800 (China Standard Time)

The documentation has been updated, I think this issue can now be closed?

TW · Answer 21 · Wed Apr 10 2019 15:59:22 GMT+0800 (China Standard Time)

yup, thanks!

Devin Bayer · Answer 22 · Sat Aug 10 2019 03:56:44 GMT+0800 (China Standard Time)

I am seeing this issue on BorgBase when syncing the indexes of an existing repo - could it be that it isn't configured correctly?

Manu · Answer 23 · Sat Aug 10 2019 09:11:04 GMT+0800 (China Standard Time)

Thanks for the pointer, @ThomasWaldmann . I did NOT have those values in my configs, but the default, which is:

ClientAliveInterval 0
ClientAliveCountMax 3

Just updated this for current and future setups. Hope this solves the issue @akvadrako is observing.

Devin Bayer · Answer 24 · Sat Aug 10 2019 16:39:33 GMT+0800 (China Standard Time)

Thanks @m3nu - it turns out I was seeing this errors while over a VPN. If I see them again I'll let you know.

Andy Bulka · Answer 25 · Tue Jan 31 2023 11:59:56 GMT+0800 (China Standard Time)

I think the actual documentation of the ssh fix referred to here is
https://borgbackup.readthedocs.io/en/stable/usage/serve.html?highlight=lock-wait#examples
which involves

editing the server's /etc/ssh/sshd_config and restarting the ssh service
editing client side’s ~/.ssh/config
modifying the server e.g. borg-backup user's authorized_keys to include the --lock-wait 600 parameter to the borg serve command therein e.g. presumably something like: command="/usr/local/bin/borg serve --lock-wait 600 --restrict-to-path /volume1/BorgBackup/" ssh-ed25519 AAAACxxxxxx...etc

I'm trying out these steps now...

TW · Answer 26 · Wed Feb 01 2023 00:30:48 GMT+0800 (China Standard Time)

Not sure if enforcing --lock-wait via authorized_keys is a) working and b) is helpful - usually the client shall tell how long it wants to wait for a locked repo to get unlocked.

Andy Bulka · Answer 27 · Wed Feb 01 2023 14:55:56 GMT+0800 (China Standard Time)

Yeah, turns out the ssh tweaks above didn't help me.

Sigh... not sure what to try next - have raised #7313

Saksham · Answer 28 · Sat Feb 04 2023 20:48:30 GMT+0800 (China Standard Time)

Yeah, turns out the ssh tweaks above didn't help me.

Sigh... not sure what to try next - have raised #7313

Yeah same for me, tried all the above tweaks but raises the "Connection closed by remote host" error after a while no matter what

Manu · Answer 29 · Sat Feb 04 2023 20:57:53 GMT+0800 (China Standard Time)

Borgmatic has a retry setting that will just retry the backup. Decrease checkpoint time if you want.

I keep a long FAQ on the issue here: https://docs.borgbase.com/faq/#my-ssh-connection-breaks-after-a-long-backup-or-prune-operation

May give some new ideas.

TW · Answer 30 · Sun Feb 05 2023 01:09:53 GMT+0800 (China Standard Time)

@m3nu That retry script, does it retry on rc==1? That would be bad...

Manu · Answer 31 · Sun Feb 05 2023 01:30:59 GMT+0800 (China Standard Time)

Good point. Borgmatic doesn't, but the general retry script treats it as error and retries 10x. I've added a note on this in the FAQ. Good to mention it.

Saksham · Answer 32 · Mon Feb 06 2023 18:14:45 GMT+0800 (China Standard Time)

Borgmatic has a retry setting that will just retry the backup. Decrease checkpoint time if you want.

I keep a long FAQ on the issue here: https://docs.borgbase.com/faq/#my-ssh-connection-breaks-after-a-long-backup-or-prune-operation

May give some new ideas.

Thanks a lot! But tbh a bit janky but it's working i guess

I actually ended up setting the tcp keepalive to no coz it gave me better results but my tests were not that precise so i may be wrong