sujitpal / nerds

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

nerds

nerds logo

How to set up a DEV environment

Required Python version >= 3.6

Setting up the environment with pipenv

pipenv is a utility that manages virtual environments and pip dependencies at the same time. To install it, navigate to the project's root directory and run:

pip3 install pipenv

This will make sure that pipenv uses your latest version of Python3, which is hopefully 3.6 or higher. Please refer to the official website for more information on pipenv.

A Makefile has been created for convenience, so that you can install the project dependencies, download the required models, test and build the tool easily. Note that this is the preferred environment setup approach, the Pipfile and Pipfile.lock files ensure that you automatically have access to the installed packages in requirements.txt after you do a make install (see below).

Setting up the environment using conda

Alternatively, if you are using the Anaconda distribution of Python, you can also use conda to create an environment using the following command:

conda create -n nerds python=3.6 anaconda

You can then enter the newly created conda environment using the following command. After you run the various make ... commands, the packages listed in requirements.txt and the downloaded models will only be visible inside the nerds environment. This approach is usually preferred since it can help prevent version collisions between different environments, at the cost of more disk space.

conda activate nerds

and exit the environment using the following command.

conda deactivate

Makefile specifications

To install all of the required packages for development and testing run:

make install

The tool will not run without an English language model and a tagger. To download spacy's English language model and NLTK's default tagger run:

make download_models

To execute the unit tests run:

make test

Code quality checks can be run with:

make lint

A wheel distribution of this tool can be created with:

make dist

How to write your own NER model

NERDS is a framework that provides some NER capabilities - among which the option of creating ensembles of NER models - but primarily made to be extended. In the following sections we take a look at the basic data exchange classes, and how you can use them to create your own models.

Understanding the main data exchange classes

The NERDS master project on elsevierlabs-os/nerds project uses a set of custom data exchange classes Document, Annotation, and AnnotatedDocument. The project provided a set of conversion utilities which could be used to convert provided datasets to this format, and convert instances of these classes back to whatever format the underlying wrapped NER model needed. However, this NERDS fork on sujitpal/nerds eliminates this requirement -- the internal format is just a list of list of tokens (words in sentence) or BIO tags. The utility function nerds.utils.load_data_and_labels can read a file in CoNLL BIO format and convert to this internal format. This decision was made because 3 of the 5 provided models consume the list of list format natively, and the result is fewer lines of extra code and less potential for error.

In general, when given an input format that is not in CoNLL BIO format, the main effort in using NERDS would be to convert it to CoNLL BIO format. Once that is done, it is relatively easy to ingest it into a data and label structure, as shown below.

from nerds.utils import load_data_and_labels

data, labels = load_data_and_labels("nerds/test/data/example.iob")
print("data:", data)
print("labels:", labels)

yields the following output.

data: [
  ['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', 'Nov', '.', '29', '.'], 
  ['Mr', '.', 'Vinken', 'is', 'chairman', 'of', 'Elsevier', 'N', '.', 'V', '.', ',', 'the', 'Dutch', 'publishing', 'group', '.']
]
labels [
  ['B-PER', 'I-PER', 'O', 'B-DATE', 'I-DATE', 'I-DATE', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-DATE', 'I-DATE', 'I-DATE', 'O'], 
  ['B-PER', 'I-PER', 'I-PER', 'O', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'O', 'O', 'B-NORP', 'O', 'O', 'O']
]

Extending the base model class

The basic class that every model needs to extend is the NERModel class in the nerds.models package. The model class implements a fit - predict API, similarly to sklearn. To implement a new model, one must extend the following methods at minimum:

  • fit(X, y): Trains a model given a list of list of tokens X and BIO tags y.
  • predict(X): Returns a list of list of BIO tags, given a list of list of tokens X.
  • save(dirpath): Saves model to directory given by dirpath.
  • load(dirpath): Retrieves model from directory given by dirpath.

As a best practice, I like to implement a single NER model (or group of related NER models) as a single file in the models folder, but have it be accessible from client code directly as nerds.models.CustomNER. You can set this redirection up in nerds/models/__init__.py.

Running experiments

There are two examples of running experiments using NERDS. We will continue to update these examples as more functionality becomes available.

Contributing to the project

New models and input adapters are always welcome. Please make sure your code is well-documented and readable. Before creating a pull request make sure:

  • make test shows that all the unit test pass.
  • make lint shows no Python code violations.

The CONTRIBUTING.md file lists contributors who have contributed to the NERDS (elsevierlabs-os/nerds) project.

Changes / Improvements in this Fork

The CHANGES.md file lists the changes and improvements that were made in this fork.

Talks and Blogs

About

License:BSD 3-Clause "New" or "Revised" License


Languages

Language:Python 99.8%Language:Makefile 0.2%