Janus stores XML documents and annotations on them in Elasticsearch.
You need Elasticsearch to use Janus. Copy config-template.yml
to
config.yml
and edit it to reflect your Elasticsearch settings. The
default settings should work if you want to do a quick test using the Docker
Hub image (docker run -p 9200:9200 elasticsearch:5.4-alpine
).
Now compile and run Janus:
mvn clean package && ./target/appassembler/bin/janus server config-template.yml
or use the Dockerfile, docker run --net=host $(docker build -q .)
.
Under elasticsearch.fields
in the config file is mapping from XML
elements (using XPath) to fields in Elasticsearch. To see what this does,
upload a document to have it indexed:
You can now upload an XML file to Janus to have it indexed as a document with one annotation per XML element:
curl -X PUT -H "Content-Type: application/xml" \ http://localhost:8080/documents/some_id --data-binary @example.xml curl http://localhost:8080/documents/some_id
Add an annotation:
curl -X POST -H 'Content-Type: application/json' \ http://localhost:8080/documents/some_id/annotations -d '{ "target": "some_id", "start": 4, "end": 10, "type": "note", "source": "user" }'
This reports the (autogenerated) id of the annotation.
Janus stores documents, which are strings of (Unicode) text, and annotations. Each document and each annotation has an identifier, which uniquely addresses it within a running Janus instance.
The documents go into an Elasticsearch index to allow full-text search.
Annotations are spans of text with some metadata. They can be thought of as
a span (start
, end
) within a document, called the target
. The
target is denoted in the API by its identifier. In addition, an annotation
has a source
and a type
. The source
denotes where the annotation
came from, and could be "user"
for a GUI tool or "ner"
for a
named-entity recognizer. Only the source "xml"
is used by Janus itself,
in its XML uploader. The type
field is free for the source to fill in.
An annotation also has attributes (attrib
), a key-value (string-string)
dictionary that corresponds to the attributes on XML tags.
Finally, an annotation may have a body
, which is (the identifier of) a
different document. An annotation without a body can be thought of as a
highlighted portion of text. The annotation
{"begin": 0, "end": 4, "target": "draft-paper", "source": "user", "type": "green"}
might indicate that the user used a green highlighter on the first four
characters (Unicode codepoints) of the document draft-paper
. By contrast,
{"begin": 4, "end": 10, "target": "draft-paper", "source": "user", "type": "yellow", "body": "note-1"}
would indicate that the user put a yellow sticky note on draft-paper
,
to annotate the next six characters. The sticky note is itself a document,
its contents being the document note-1
.
Janus communicates to the outside world using a web API. For documentation, fire up Janus and visit http://localhost:8080/swagger/.
As an alternative to the REST API, we are developing a GraphQL endpoint for
communicating with Janus. It can be reached at /graphql
and currently
supports only read queries, not mutations.
The GraphQL schema can be found in the source code at
src/main/resources/schema.graphqls
. It can also be obtained by
introspection on the GraphQL
endpoint, e.g.,
curl -H 'Content-Type: application/graphql' \ http://localhost:8080/graphql -d '{ __schema { queryType { name fields { name type { name kind } } } } } ' | jq .data
Janus has special support for XML documents, which are parsed and turned into a flat text document and one annotation per XML element. The configuration file has more details on the way the XML is parsed.
The text document corresponds to the text in between the tags; in XPath
terminology, it's the string()
of the whole document. Each element is
turned into an annotation with the following properties:
- The
source
of the annotation is"xml"
. - The
type
is its XML tag. - The
target
is the document's id. - The
start
andend
denote where the start and end tags were found in the text document. - The attributes of the tag are stored in the
"attrib"
field, as strings. - The
body
is empty (null).
To get the XML document as you uploaded it, use the /orig
path on the
document, e.g.:
curl http://localhost:8080/documents/some_id/orig
To upload XML files in bulk for indexing, use something like:
find some_dir -name '*.xml' -print0 | xargs -0 -n 1 -P "$(nproc)" sh -c ' curl -s -X PUT -H "Content-Type: application/xml" \ http://localhost:8080/documents/$(uuidgen) --data-binary @$0 echo " " $0 '
This indexes all XML files below some_dir
, assigning to each a UUID.
It prints to stdout a list of UUID/path pairs.