Implement various harvesting strategies properly.
miku opened this issue · comments
Martin Czygan commented
metha should implement various harvesting strategies:
- normal/default (for standard conform endpoints), harvest windows, daily, monthly, yearly, all
- single records, so individual records may fail or servers are not overloaded
- other modes: all at once
Implementation ideas:
Instead of relying only on files, introduce a small manifest.json describing the harvested content (ids, dates, harvesting dates, files).
Martin Czygan commented
metha was meant be a very simple program (no database, only files and not even metadata about files). In order to keep it simple, a more resilient harvesting approach has been implemented in a separate program: oaicrawl.