BioC annotation

This repository experiments with using BioC format to hold annotations of scientific articles, and the use of tools such as simple regular expression taggers and CRF to annotate documents.

BioC format

The BioC annotation format is described in:

Donald C. Comeau, Rezarta Islamaj Doğan, Paolo Ciccarese, Kevin Bretonnel Cohen, Martin Krallinger, Florian Leitner, Zhiyong Lu, Yifan Peng, Fabio Rinaldi, Manabu Torii, Alfonso Valencia, Karin Verspoor, Thomas C. Wiegers, Cathy H. Wu, W. John Wilbur, BioC: a minimalist approach to interoperability for biomedical text processing, Database, Volume 2013, 2013, bat064, doi:10.1093/database/bat064

The JSON version is used by PubTator.

Tools

bioc2html

bioc2html.php takes a BioC JSON file and outputs a HTML file with the annotations displayed as coloured boxes on the text (inspired by PubTator).

jats2bioc

jats2bioc.php takes a JATS XML file for an article and converts it to BioC JSON. Very crude and incomplete, but key feature is extracting marked-up entities in the XML and outputting them as annotations. These can be visualised using bioc2html.php. One use of jats2bioc.php is to generate training data from content such as articles from Pensoft where entities such as taxonomic names are often already marked-up.

bioctagger

bioctagger.php reads a BioC file and adds annotations for various entities that it finds in the text passages. These entities are found using, for example, simple regular expressions. This tool is intended to be a quick and dirty way of generating training data.

bioc2crf

bioc2crf.php takes a BioC JSON file and exports it to a data and template file that can be used by a CRF tool (in progress). Annotations are in IOB format.

crf2bioc

crf2bioc.php takes results of CRF and generates a BioC JSON file, which can then be visualised using bioc2html.php.

rdmpage / bioc-annotation