miku / metha

Command line OAI-PMH harvester and client with built-in cache.

Home Page:https://lab.ub.uni-leipzig.de/metha/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Why is data only harvested up to the last day?

gunnihinn opened this issue · comments

The readme says

Currently, there is a limitation which only allows to harvest data up to the last day. Example: If the current date would be Thu Apr 21 14:28:10 CEST 2016, the harvester would request all data since the repositories earliest date and 2016-04-20 23:59:59.

which is indeed the current behavior. Do you remember what the reason for this limitation is? Is it something inherent in the OAI protocol, or does it come from somewhere else?

I'm using metha to harvest the arXiv and am curious about this one-day delay.

It is not a limitation of the protocol, but a implementation tradeoff - that I'd like to fix in some future version.

Basically: OAI allows two date granularities, day and second. In order to have a single filename type on disk (e.g. 2018-04-30-00000000.xml.gz), we used the coarser granularity. Also, we wanted to avoid having to check for duplicates (e.g. when requesting an endpoint, that only supports day-granularity every hour).

It's not ideal, and I have some prototypes for more seamless handling, already - just need to weave it into metha.

Thanks, that's fair enough. If you'd like some help with or review of any of those prototypes or their design or implementation, I'd be happy to be of assistance.

If you'd like some help with or review of any of those prototypes or their design or implementation, I'd be happy to be of assistance.

Thanks.