bioNER

In this repository, the files for bioNER project are stored.

bioNER Project Description

Anastasia Nikiforova and Marina Pozhidaeva

Clinical trials often require detailed patient information, documented in clinical descriptions. Named object recognition (NER) is a fundamental natural language processing (NLP) task for extracting objects of interest (for example, disease names, drug names, and laboratory tests) from clinical notes, to further support and develop clinical and applied research.

The dataset consists of clinical notes annotated with BIO markup (beginning-inside-outside). It is available for download from Kaggle: https://www.kaggle.com/rsnayak/hackathon-disease-extraction-saving-lives-with-ai.

The task is to extract all the names of diseases from a given set of 20,000 paragraphs / documents in a test set, training the model on a labeled training dataset of 30,000 documents.

Data Description

Variable Definition

id Unique ID for a token/word

Doc_ID Unique ID for a Document/Paragraph

Sent_ID Unique ID for a Sentence

Word Exact word/token

tag (Target) Named Entity Tag

Models description

The word embedding (pre-trained fasttext in the base line) will be fed to the input, and at the output we will get its tag (BIO).

One of the most common models for NER, bi-LSTM + CRF, shows the f1-score = 0.86 on the general-domain package. As part of this task, we will strive for this mark, however, since our data is limited by bio-domain, the result may vary.

Models for download

LSTM + CRF model: https://drive.google.com/open?id=1Ij-mMy8EWQ0aphCFZpVUnmGcfHckZQb6

LSTM on pretrained embeddings: https://drive.google.com/open?id=1BwRoTNKqTkLAp3NF-wDeHX-esu-zYYLY

Filtered pretrained embeddings: https://drive.google.com/open?id=1Pm7IolxWa5juZM5RcUvN9R6SC6GKq7Ma

Sentences tagged with UDPipe in CONLLu format: https://drive.google.com/open?id=1-68F34nTyLz-SetqBMcsdqtn0n7NYjBl

steysie / bioNER

bioNER

bioNER Project Description

Data Description

Models description

Models for download

About

Languages