miku / metha

Command line OAI-PMH harvester and client with built-in cache.

Home Page:https://lab.ub.uni-leipzig.de/metha/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Request Entity Too Large

zazi opened this issue · comments

while trying to harvest authority data from DNB OAI endpoint, I'm getting following error:

INFO[0000] https://services.dnb.de/oai/repository?from=2008-04-01T00:00:00Z&metadataPrefix=MARC21-xml&set=authorities:person&until=2008-04-30T23:59:59Z&verb=ListRecords 
FATA[0001] failed with Request Entity Too Large on https://services.dnb.de/oai/repository?from=2008-04-01T00:00:00Z&metadataPrefix=MARC21-xml&set=authorities:person&until=2008-04-30T23:59:59Z&verb=ListRecords: <nil>

any chance to fix this?

this metha-sync call is following:

metha-sync -format MARC21-xml -set authorities:person https://services.dnb.de/oai/repository

@zazi, thanks for the bug report. Could reproduce. The DNB endpoint is in general relatively broken. I believe I saw this error before:

Your request matches to many records (&gt;100000). The result size is 353017. Please try to restrict the request-period.

$ curl -vL "https://services.dnb.de/oai/repository?from=2008-04-05T00:00:00Z&metadataPrefix=MARC21-xml&set=authorities:person&until=2008-04-05T23:59:59Z&verb=ListRecords"
<html><head><title>Error</title></head><body>Your request matches to many records (&amp;gt;100000). The result size is 353017. Please try to restrict the request-period.</body></html>

It really odd, because even on a daily slice (using the -daily flag) it is too much. If, in theory, all records would have a single timestamp, there would be no way at all to retrieve the records in a windowed fashion - which in turn means that it is not fully OAI compliant.

Next thing I would try would be:

$ oaicrawl -verbose -f MARC21-xml https://services.dnb.de/oai/repository

We wrote oaicrawl for zvdd.de OAI, because it's calling itself OAI, despite being broken. The oaicrawl is a much blunter tool, it will fetch all identifiers (ListIdentifiers) and request records one-by-one (GetRecord). Let's see what happens with DNB:

$ oaicrawl -verbose -f MARC21-xml https://services.dnb.de/oai/repository
FATA[2018-07-30T14:15:52+02:00] expected element type <OAI-PMH> but have <html> 

Digging into it a bit more:

<title>Error</title>Your request matches to many records (&gt;100000). The result size is 13413063. Please try to restrict the request-period.

Now, let me rant on a bit. Why does OAI has so-called "resumption-tokens" at all? Datacite, base (Bielefeld) and other huge repositories can work just fine by paging through the data (tens of millions of records) for days. It's a DNB problem, it would be best, if they use their own resources to solve this problem.

thanks a lot @miku for your very fast reply. I was also on trying oaicrawl for this, but then I thought that it might be a bit to much fetching this rather larger authorities set 1-by-1 from DNB - so I skipped this approach. Furthermore, as far as I understood the arguments from oaicrawl - I cannot define the concrete set over there, or?
Thanks a lot for your feedback, I'll forward it to DNB somehow.
For our concrete usecase it probably might even be enough to get the data excerpt from "Sächsische Bibliographie" via SRU. Then I "only" need to be able to define the appropriate CQL query (which is a bit out of my knowledge so far).

while writing the draft for an answer to DNB and reading their OAI docs again, I came to a possible solution:
since the request return a 413, which is a standard HTTP status code from RFC 7231 - one can make use of this information and reduce the standard interval from daily to e.g. hourly for such cases (which requires to set both parameters, from and until, in the request).

Does this sound like a solution for you @miku ?

PS: the DNB OAI docu also says "Depending on the OAI repository these can be either defined to the day (YYYY-MM-DD) or to the second (YYYY-MM-DDThh:mm:ssZ)" - so working with hourly slice might be possible.

curl -vL "https://services.dnb.de/oai/repository?from=2008-04-05T13:00:00Z&until=2008-04-05T14:00:00Z&metadataPrefix=MARC21-xml&set=authorities:person&verb=ListRecords"

delivers at least some results (incl. a resumption token)

I cannot define the concrete set over there, or?

Yes, oaicrawl was more of a one-shot for a particular endpoint and has a minimal feature set.

Thanks a lot for your feedback, I'll forward it to DNB somehow.

I can try to do the same.

Does this sound like a solution for you @miku ?

Yes, sure this is an option. This is also a limitation of metha, which I would like to get rid of one day (it was not essential for the use cases so far, so it is not implemented): It has only monthly and daily slices, not arbitrary precision.

Ok, we've send a request to DNB, whether they can increase the result size limit. On the other side, we would appreciate, when you could implement the proposed fall-back functionality, when a 413 will be thrown, i.e., decrease the interval temporarily to hourly (and the go back to daily).