FEATURE: More flexible ingestion command

Question

FEATURE: More flexible ingestion command

PlainSite opened this issue 7 months ago · comments

The alephclient ingestion tool has a "crawldir" option to ingest every file in a folder, but in many cases I just want to ingest one file at a time. This is because I often add PDFs to a folder where some of the PDFs (the old ones) have already been ingested/OCRed but the new ones have not. It's a waste of CPU resources to repeatedly re-ingest documents that are already in Aleph.

Either by default the alephclient command should skip over files where the SHA-1 checksum matches a checksum already in ElasticSearch (isn't this the point of storing the checksum?) with something like a --force option if it's really necessary to re-run the ingestion process for those files, and/or alephclient should allow more precise targeting of specific files to avoid the re-ingestion problem.

The alternative of moving files to a random folder in /tmp to ingest them there isn't ideal because alephclient absorbs some of that file structure metadata when it's figuring out what it's ingesting.

It would also be nice to be able to define metadata to go along with an ingestion batch by passing some information from the command line. Right now I'm not sure how to do that, and such metadata might be useful later to filter ingested data in a certain way.

Till Prochaska commented 6 months ago

Steve Haffenden · Answer 1 · Tue Jan 16 2024 19:15:40 GMT+0800 (China Standard Time)

Hey @PlainSite

Thanks for raising this feature request. This is something that we are aware of, and that we'd like to address with time. Right now the team are focused on a significant backlog of other work, so it may be a while before we can get to this. If you felt like putting together a pull request then we'd love to review it.

Thanks