webanno_tsv

A python library to parse TSV files as produced by the webanno Software and as described in their documentation.

The following features are supported:

WebAnno's UTF-16 indices for Text indices
Webanno's escape sequences
Multiple span annotation layers with multiple fields
Span annotations over multiple tokens and sentences
Multiple Annotations per field (stacked annotations)
Disambiguation IDs (here called label_id)

The following is not supported:

Relations
Chain annotations
Sub-Token annotations (ignored on reading)

Installation

pip install git+https://github.com/neuged/webanno_tsv

Examples

To construct a Document with annotations you could do:

from webanno_tsv import Document, Annotation
from dataclasses import replace

sentences = [
    ['First', 'sentence'],
    ['Second', 'sentence']
]
doc = Document.from_token_lists(sentences)

layer_defs = [('Layer1', ['Field1']), ('Layer2', ['Field2', 'Field3'])]
annotations = [
    Annotation(tokens=doc.tokens[1:2], layer='Layer1', field='Field1', label='ABC'),
    Annotation(tokens=doc.tokens[1:3], layer='Layer2', field='Field3', label='XYZ', label_id=1)
]
doc = replace(doc, annotations=annotations, layer_defs=layer_defs)
doc.tsv()

The call to doc.tsv() then returns a string:

#FORMAT=WebAnno TSV 3.3
#T_SP=Layer1|Field1
#T_SP=Layer2|Field2|Field3


#Text=First sentence
1-1	0-5	First	_	_	_
1-2	6-14	sentence	ABC	*[1]	XYZ[1]

#Text=Second sentence
2-1	15-21	Second	_	*[1]	XYZ[1]
2-2	22-30	sentence	_	_	_

Supposing that you have a file with the output above as input you could do:

from webanno_tsv import webanno_tsv_read_file, Document

doc = webanno_tsv_read_file('/tmp/input.tsv')

for token in doc.tokens:
    if token.text == 'sentence':
        print(token.sentence_idx, token.idx)

# Prints:
# 1 2
# 2 2

for annotation in doc.match_annotations(layer='Layer2'):
    print(annotation.layer, annotation.field, annotation.label)

# Prints:
# Layer2 Field3 XYZ

for annotation in doc.match_annotations(sentence=doc.sentences[0]):
    print(annotation.layer, annotation.field, annotation.label)

# Prints:
# Layer1 Field1 ABC
# Layer2 Field3 XYZ

# Some lookup functions for convenience are on the Document instance
doc.token_sentence(token[0])
doc.sentence_tokens(doc.sentence[0])
doc.annotation_sentences(doc.annotations[0])

Possible Gotcha: The classes in this library are read-only dataclasses (dataclasses with frozen=True).

This means that their fields are not settable. You can create new versions however with dataclasses.replace().

from dataclasses import replace

t1 = Token(sentence_idx=1, idx=0, start=0, end=3, text='Foo')
t2 = replace(t1, text='Bar')

Development

Run the tests with:

python -m unittest test/*.py

PRs always welcome!

RinaldsViksna / webanno_tsv

webanno_tsv

Installation

Examples

Development

About

Languages