spallas / wsd

A 78.5% word sense disambiguator based on Transformers and RoBERTa (PyTorch)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Neural Word Sense Disambiguation integrating synonyms in WordNet synsets

Intro

In this work we present a Word Sense Disambiguation (WSD) engine that integrates a Transformer-based neural architecture with knowledge present in WordNet, the resource from which the sense inventory is taken from.

Model

The architecture is composed of contextualized embeddings plus a Transformer on top with a final dense layer.

The models available include a base RoBERTa embeddings and are:

  • rdense with only a two dense layer encoder.
  • rtransform with a Transformer encoder.
  • wsddense with a two dense layer encoder + an advanced lemma prediction net.
  • wsdnetx same as above but with a Transformer encoder.

The advanced net can be represented as:

arch

where h is the final hidden state of the encoder. The |S|x|V| matrix is build like in the following: sv-matrix

Training data

As a training dataset we use both SemCor and WordNet Gloss Corpus.

Environment setup

git clone http://github.com/spallas/wsd.git

cd wsd/ || return 1

tmux new -s train

python -c "import torch; print(torch.__version__)"

source setup.sh

Unzip in the res/ folder the pre-processed training and test data that you can download here. Also unzip in res dictinaries data that you can download here

Further Details

Please refer to the wiki page in this repository for further details about the implementation.

Notes: RoBERTa installation

# Download roberta.large model
cd res/
wget https://dl.fbaipublicfiles.com/fairseq/models/roberta.large.tar.gz
tar -xzvf roberta.large.tar.gz
# Load the model in fairseq
from fairseq.models.roberta import RobertaModel
roberta = RobertaModel.from_pretrained('res/roberta.large', checkpoint_file='model.pt')
roberta.eval()  # disable dropout (or leave in train mode to finetune)

About

A 78.5% word sense disambiguator based on Transformers and RoBERTa (PyTorch)

License:MIT License


Languages

Language:Python 99.3%Language:Shell 0.7%