davidar/oai-sync

This is a quick and dirty script for harvesting metadata from an OAI-PMH enabled repository.

It relies on the Net::OAI::Harvester perl module, and is derived from the oai-listrecords script provided in the examples for that library.

Example for harvesting arXiv metadata (honouring flow control directives):

mkdir arxiv{0,1,2}

# initial sync
./oai-sync.pl --baseURL=http://export.arxiv.org/oai2 --metadataPrefix=arXiv \
              --dumpDir=./arxiv0/

# resume an interrupted sync with the given token
./oai-sync.pl --baseURL=http://export.arxiv.org/oai2 --metadataPrefix=arXiv \
              --dumpDir=./arxiv1/ --resumptionToken='XXXXXX|XXXXXX'

# sync any new changes since the date of the last sync
./oai-sync.pl --baseURL=http://export.arxiv.org/oai2 --metadataPrefix=arXiv \
              --dumpDir=./arxiv2/ --from='2015-03-14'

# split into individual records
for F in ./arxiv*/*.xml; do [ -s $F ] && ./oai-split.sh < $F; done

davidar / oai-sync

About

Languages