Client Timeout

Question

Client Timeout

tobiasschweizer opened this issue 2 years ago · comments

Tobias Schweizer commented 2 years ago

Hi,

Is there a way to increase the client timeout?
I did a quick search for "timeout" but the only thing I found was:

I wanted to crawl Arxiv but found that existing tools would timeout.

:-)

We are harvesting quite a big collection using metha-sync and got

FATA[5443] Get "https://xyz.ch/request?resumptionToken=2022-05-01T00:00:00Z@2022-05-31T23:59:59Z@set_name@marc21@111111111&verb=ListRecords": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

For now, I have just started the process again.

Thanks for any hint!

Martin Czygan · Answer 1 · Thu Jun 23 2022 07:04:16 GMT+0800 (China Standard Time)

Yes, encountered similar things and timeout was an option internally, but not yet exposed as a flag. I added that in v0.2.32; maybe you could try it with something like:

$ metha-sync -T 5m https://xyz.ch

Tobias Schweizer · Answer 2 · Thu Jun 23 2022 16:01:09 GMT+0800 (China Standard Time)

Hi @miku,

Great, thanks a lot for the quick reaction and releasing this so fast! :-)

I updated to 0.2.32 and am retrying with the new flags T and r. I'll keep you posted!

Tobias Schweizer · Answer 3 · Fri Jun 24 2022 16:27:18 GMT+0800 (China Standard Time)

I did not get a timeout anymore. However, I got this

FATA[36306] read tcp xyz->abc: read: connection reset by peer

I think since the collection is very large and the harvesting takes very long the server is under some stress.

Are there best practices to automatically check whether the metha-sync harvesting process is still running and restart it if necessary?

Martin Czygan · Answer 4 · Fri Jun 24 2022 17:03:52 GMT+0800 (China Standard Time)

We had similar experiences, where e.g. the result set got too large for server and it bailed out (timeout on server side).

One workaround may be to shrink the time window for the harvest, e.g. from monthly (which is the default and seem to work for 99.5% of the cases) to daily (resulting in hopefully smaller result sets):

$ metha-sync -daily -T 5m https://abc.de

Note that this will harvest into the same directory and metha-cat just picks up any file in the harvesting directory (there is currently no -daily flag on metha-cat). Maybe best would be to stash the existing directory somewhere (if you do not want to lose the previous harvest):

$ mv $(metha-sync -dir https://abc.de) /tmp

[...] and then to start anew with -daily slices.

For automatic restart, I may try a shell wrapper first, something like:

$ until metha-sync -T 5m https://abc.de; do echo "retrying..."; sleep 3; done

I believe metha returns non-zero on failure ("FATA"), so that should work (also, no half-harvested files should remain).