Janus

Janus stores XML documents and annotations on them in Elasticsearch.

Usage

You need Elasticsearch to use Janus. Copy config-template.yml to config.yml and edit it to reflect your Elasticsearch settings. The default settings should work if you want to do a quick test using the Docker Hub image (docker run -p 9200:9200 elasticsearch:5.4-alpine). Now compile and run Janus:

mvn clean package &&
    ./target/appassembler/bin/janus server config-template.yml

or use the Dockerfile, docker run --net=host $(docker build -q .).

Under elasticsearch.fields in the config file is mapping from XML elements (using XPath) to fields in Elasticsearch. To see what this does, upload a document to have it indexed:

You can now upload an XML file to Janus to have it indexed as a document with one annotation per XML element:

curl -X PUT -H "Content-Type: application/xml"  \
    http://localhost:8080/documents/some_id --data-binary @example.xml
curl http://localhost:8080/documents/some_id

Add an annotation:

curl -X POST -H 'Content-Type: application/json' \
    http://localhost:8080/documents/some_id/annotations -d '{
        "target": "some_id", "start": 4, "end": 10,
        "type": "note", "source": "user"
    }'

This reports the (autogenerated) id of the annotation.

Conceptual overview

Janus stores documents, which are strings of (Unicode) text, and annotations. Each document and each annotation has an identifier, which uniquely addresses it within a running Janus instance.

The documents go into an Elasticsearch index to allow full-text search.

Annotations are spans of text with some metadata. They can be thought of as a span (start, end) within a document, called the target. The target is denoted in the API by its identifier. In addition, an annotation has a source and a type. The source denotes where the annotation came from, and could be "user" for a GUI tool or "ner" for a named-entity recognizer. Only the source "xml" is used by Janus itself, in its XML uploader. The type field is free for the source to fill in. An annotation also has attributes (attrib), a key-value (string-string) dictionary that corresponds to the attributes on XML tags.

Finally, an annotation may have a body, which is (the identifier of) a different document. An annotation without a body can be thought of as a highlighted portion of text. The annotation

{"begin": 0, "end": 4, "target": "draft-paper", "source": "user",
 "type": "green"}

might indicate that the user used a green highlighter on the first four characters (Unicode codepoints) of the document draft-paper. By contrast,

{"begin": 4, "end": 10, "target": "draft-paper", "source": "user",
 "type": "yellow", "body": "note-1"}

would indicate that the user put a yellow sticky note on draft-paper, to annotate the next six characters. The sticky note is itself a document, its contents being the document note-1.

REST API

Janus communicates to the outside world using a web API. For documentation, fire up Janus and visit http://localhost:8080/swagger/.

GraphQL endpoint

As an alternative to the REST API, we are developing a GraphQL endpoint for communicating with Janus. It can be reached at /graphql and currently supports only read queries, not mutations.

The GraphQL schema can be found in the source code at src/main/resources/schema.graphqls. It can also be obtained by introspection on the GraphQL endpoint, e.g.,

curl -H 'Content-Type: application/graphql' \
    http://localhost:8080/graphql -d '{
        __schema {
            queryType {
                name
                fields {
                    name
                    type {
                        name
                        kind
                    }
                }
            }
        }
    }
' | jq .data

Special support for XML

Janus has special support for XML documents, which are parsed and turned into a flat text document and one annotation per XML element. The configuration file has more details on the way the XML is parsed.

The text document corresponds to the text in between the tags; in XPath terminology, it's the string() of the whole document. Each element is turned into an annotation with the following properties:

The source of the annotation is "xml".
The type is its XML tag.
The target is the document's id.
The start and end denote where the start and end tags were found in the text document.
The attributes of the tag are stored in the "attrib" field, as strings.
The body is empty (null).

To get the XML document as you uploaded it, use the /orig path on the document, e.g.:

curl http://localhost:8080/documents/some_id/orig

Example: bulk indexing

To upload XML files in bulk for indexing, use something like:

find some_dir -name '*.xml' -print0 |
    xargs -0 -n 1 -P "$(nproc)" sh -c '
        curl -s -X PUT -H "Content-Type: application/xml"  \
            http://localhost:8080/documents/$(uuidgen) --data-binary @$0
        echo " " $0
    '

This indexes all XML files below some_dir, assigning to each a UUID. It prints to stdout a list of UUID/path pairs.

HuygensING / janus