wtsi-npg / baton

iRODS client programs and API

Home Page:http://wtsi-npg.github.io/baton

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Connect failure: was it an error, and should we retry?

mcast opened this issue · comments

What happened

While running a 16 element LSF arrayjob doing (baton-metasuper; imv; imeta ls -d) per file at about 0.4 files per sec per job, against an iCAT on a VM with 4 GiB RAM and 4 CPUs and a 4 node (RAC cluster?) Oracle database behind it,

: ---
: query:
:   - avus:
[...]
:       - attribute: test-dev1-prelive.id_file
:         value: 975811
:       - attribute: test-dev1-prelive.id_ifile
:         value: 2034712
:       - attribute: test-dev1-prelive.version
:         value: 1
:     collection: /cgp/sandbox/DEV1/prelive/ireg-tmp/id_p_u=241
:     data_object: 975811.v1.caveman_c.tar.gz
ERROR: readMsgHeader:header read- read 0 bytes, expect 4, status = -4115
ERROR: connectToRhost: readVersion to irods-cgp-g1-vm.[...] failed, status = -4115 status = -4115 SYS_HEADER_READ_LEN_ERR, Operation now in progress
ERROR: _rcConnect: connectToRhost error, server on irods-cgp-g1-vm.[...] is probably down status = -4115 SYS_HEADER_READ_LEN_ERR, Operation now in progress
2016-05-13T14:44:47 ERROR Failed to connect to irods-cgp-g1-vm.[...]:1247 zone 'cgp' as 'mca'.
2016-05-13T14:44:47 ERROR Processed 0 items with 0 errors.
icommand () failed, exit code 1280 at [...~]/gitwk-cgp/cgpDataOut/lib/Sanger/CGP/DataOut/Irods.pm line 767.
[E] Shutdown (exit 5) at Fri May 13 15:44:47 2016 after 84744 sec (chunk is 3%16)

My logging is slightly broken, "icommand () failed" should be telling the baton-metasuper --file ... command.

Possible changes

  • ERROR Processed 0 items with 0 errors seems to assume some definition of error which doesn't include connection problems. Maybe it should?
  • In this case, failure to connect was a transient error and would have been safe to retry.
    • This may or may not be something you want Baton to get into..?
    • I will probably have to deal with it for the icommands I'm calling anyway.

I've seen two such transient connect failures today, separated by approx 470k successful connections. The first was for imeta ls -d. Previously it ran many hundred thousand connections without this failure mode, but then I both upgraded the iCAT VM (1 GiB RAM, 1 CPU => 4, 4) and switched from imeta set -d to baton-metasuper.

  1. The error count refers to the number of JSON documents processed and the number of those where errors occurred. Given that the client never managed to connect, I think that 0 for 0 is correct. If anything, I think this requires documentation.
  2. Retry behaviour is outside the responsibility of these clients. This is partly to keep things simple and partly because connection failure are normally (for us) not transient, so failing early is preferable.

Having said that, I would accept a patch that added the number of connection retries as a CLI option, with the default being no retries.

Closed as historical, having had no activity.