Docker container for converting articles in the Pubmed Central XML format to plain text
The container makes use of code from the CCP_NLP doc2txt module. Note that a version of this container has been published on DockerHub.
- create a local directory that contains the nxml files you want to convert to text: /path/to/nxml
- create a local directory where the resulting text files wil be stored: /path/to/txt
- then run the following:
docker run --rm -v /path/to/nxml:/home/dev/input -v /path/to/txt:/home/dev/output ucdenverccp/nxml2txt:0.1
- After running, converted text files should be in
/path/to/txt/
. The text files will be compressed, and will have a.utf8.gz
file suffix. There will also be an accompanying annotation file for each processed XML file (denoted by the.ann.gz
file suffix) that catalogs character offsets within the document for document sections, e.g.ARTICLE_TITLE|0|82
indicates the article title is located between characters 0 and 82.
From the root directory of this project, run the following command:
docker build -t ucdenverccp/nxml2txt:[VERSION] .