This project is about our work on Hypernym Discovery.
The task and dataset details can be found here: https://competitions.codalab.org/competitions/17119.
The code for box embedding has been adapted from: https://github.com/ralphabb/BoxE.
Clone this repo to your machine.
Run the following command:
cd HyperBox
Install dependencies from requirements.txt.
Copy the contents of the "dataset" folder from here to the dataset folder of the HyperBox repo.
Unzip both the files and rename them to "2A_med_pubmed_tokenized.txt" and "2B_music_bioreviews_tokenized.txt" respectively.
Edit the configuration file hparams.ini to indicate if you want to train for music or medical dataset.
You can also edit hyper-parameters for training your Gensim model and Box-Embedding model in hparams.ini.
You can download the pre-trained Gensim word embeddings from the above link. Download "gensim_model" and paste it into the HyperBox repo. Download the "word_vectors_processed" folder and paste it into the HyperBox repo.
Run the following command from the root of the HyperBox repo:
mkdir gensim_model
python script/preprocess/gensim_word2vec.py
mkdir word_vectors_processed
python script/preprocess/save_embeddings.py
After training the word embeddings model, you can train the HyperBox model by running the following command:
python script/model/train.py
You can also directly download the pre-trained model from here.
Copy the contents of "pretrained_model" to "script/model/".
If you have trained from scratch, run the following command:
python script/model/validate.py [epoch]
E.g If you want to predict for model saved after epoch 1200 run:
python script/model/validate.py 1200
For pre-trained models, the epoch value should be 0.
python script/model/validate.py 0
If you have trained from scratch, run the following command:
python script/model/predict.py [epoch]
where epoch value can be found out from the best performing model obtained by validate.py.
For pre-trained models, the epoch value should be 0.
python script/model/predict.py 0
The code to find metrics is already provided by the organizers of SemEval2018-Task9.
For music corpus, Run:
python script/model/task9-scorer.py script/model/output/music_gold.txt script/model/output/music_pred.txt
For medical corpus, Run:
python script/model/task9-scorer.py script/model/output/medical_gold.txt script/model/output/medical_pred.txt
If you make use of this code, or its accompanying paper, please cite this work as follows:
Accepted at he 13th International Conference on Language Resources and Evaluation (LREC 2022).
To be added later on.
Licensed under the MIT License.
Copyright (C) 2021 Maulik Parmar