This repository experiments with using BioC format to hold annotations of scientific articles, and the use of tools such as simple regular expression taggers and CRF to annotate documents.
The BioC annotation format is described in:
Donald C. Comeau, Rezarta Islamaj Doğan, Paolo Ciccarese, Kevin Bretonnel Cohen, Martin Krallinger, Florian Leitner, Zhiyong Lu, Yifan Peng, Fabio Rinaldi, Manabu Torii, Alfonso Valencia, Karin Verspoor, Thomas C. Wiegers, Cathy H. Wu, W. John Wilbur, BioC: a minimalist approach to interoperability for biomedical text processing, Database, Volume 2013, 2013, bat064, doi:10.1093/database/bat064
The JSON version is used by PubTator.
bioc2html.php
takes a BioC JSON file and outputs a HTML file with the annotations displayed as coloured boxes on the text (inspired by PubTator).
jats2bioc.php
takes a JATS XML file for an article and converts it to BioC JSON. Very crude and incomplete, but key feature is extracting marked-up entities in the XML and outputting them as annotations. These can be visualised using bioc2html.php
. One use of jats2bioc.php
is to generate training data from content such as articles from Pensoft where entities such as taxonomic names are often already marked-up.
bioctagger.php
reads a BioC file and adds annotations for various entities that it finds in the text passages. These entities are found using, for example, simple regular expressions. This tool is intended to be a quick and dirty way of generating training data.
bioc2crf.php
takes a BioC JSON file and exports it to a data and template file that can be used by a CRF tool (in progress). Annotations are in IOB format.
crf2bioc.php
takes results of CRF and generates a BioC JSON file, which can then be visualised using bioc2html.php
.