UCDenver-ccp / pmc-xml-to-txt-docker

Docker container for converting articles in the Pubmed Central XML format to plain text

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

pmc-xml-to-txt-docker

Docker container for converting articles in the Pubmed Central XML format to plain text

The container makes use of code from the CCP_NLP doc2txt module. Note that a version of this container has been published on DockerHub.

To convert nxml files to text:

  1. create a local directory that contains the nxml files you want to convert to text: /path/to/nxml
  2. create a local directory where the resulting text files wil be stored: /path/to/txt
  3. then run the following:
docker run --rm -v /path/to/nxml:/home/dev/input  -v /path/to/txt:/home/dev/output ucdenverccp/nxml2txt:0.1
  1. After running, converted text files should be in /path/to/txt/. The text files will be compressed, and will have a .utf8.gz file suffix. There will also be an accompanying annotation file for each processed XML file (denoted by the .ann.gz file suffix) that catalogs character offsets within the document for document sections, e.g. ARTICLE_TITLE|0|82 indicates the article title is located between characters 0 and 82.

To Build the image (optional since it is available on DockerHub)

From the root directory of this project, run the following command:

docker build -t ucdenverccp/nxml2txt:[VERSION] .

About

Docker container for converting articles in the Pubmed Central XML format to plain text

License:MIT License


Languages

Language:Dockerfile 100.0%