ASKEM Extractions Data Model

This module contains the implementation of the TA1 extractions' data model for interoperability between within TA1 and with TA4.

The Entity-Relation diagram with the specification of the model is found here.

Installation

Clone this repository to your workstation, then install it in a virtual environment:

pip install ".[all]"

If you want to install the package in development mode use instead:

pip install -e ".[all]"

Installation without cloning git repository

If you want to install this module directly, without having to clone the repository locally, you can do so by running:

pip install git+https://github.com/ml4ai/ASKEM-TA1-DataModel

Uninstalling

To remove the package from a virtual environment, use:

pip uninstall askem-extractions

Usage examples

The script in examples/usage.py contains an example of how to import extractions from Arizona's text reading pipeline into the data model.

# Import Arizona extractions into our data model
path_to_json = Path(__file__).parent / Path("arizona_output_example.json")
collection = import_arizona(path_to_json)

The AttributeCollection model is able to serialize the extractions to json:

# Save the collection of arizona extractions as the standard json format
collection.save_json("temp.json")

And is able to load from previously serialized json files too:

# Reloads the collection from the json file
deserialized = AttributeCollection.from_json("temp.json")

The model loaded from disk will be equivalent to the one imported from the performer's specific output format:

# Both collections should be equal. Since we are using pydantic, it will do a deep comparison
assert collection == deserialized, "Deserialization didn't work"

Generate JSON Schema

To generate the json schema, we can leverage pydantic to do it automatically using the following code snippet:

print(AttributeCollection.schema_json(indent=2))

Dockerization

A docker container built using this project's docker file will normalize TA-1 participants' proprietary files and merge the product to a single output file.

To build the image use the following command:

docker build -f Dockerfile -t askem_ta1_datamodel

A container created from this file expects to find files /data/arizona_extractions.json and /data/mit_extractions.json and will store the output on /data/ta1_extractions.json.

The simplest way to run it is by mapping a directory containing both input files to /data, for example, the current directory:

docker run -it --rm -v $(PWD):/data askem_ta1_datamodel

Alternatively, a finer grained control of the path names can be achieved by using the command line parameters explicitly:

docker run -it --rm -v $(PWD):/data askem_ta1_datamodel ./normalize_extractions.sh -a /data/arizona_extractions.json -m /data/mit_extractions.json -o /data/ta1_extractions.json

ml4ai / ASKEM-TA1-DataModel