PolyLM

Setting up repository

Set up virtual environment.
```
python3.10 -m venv .venv 
```

Activate your environment

.\.venv\scripts\activate # Windows
source .venv/bin/activate # MacOS

Install PyTorch using the instructions on this site. Choose the stable, pip, python, and default installation.
Install other packages.
```
pip install -r requirements.txt
```

Training

For training, download the corpus dataset and place it in the data/ folder. Then run the preprocessing.py script.

The following command can be used to train a model with the same parameters as PolyLM_BASE:

python train.py --model_dir=models/ --corpus_path=data/bookcorpus/books_large_p1.txt --vocab_path=data/bookcorpus/vocab.txt --embedding_size=256 --bert_intermediate_size=1024 --n_disambiguation_layers=4 --n_prediction_layers=12 --max_senses_per_word=8 --min_occurrences_for_vocab=500 --min_occurrences_for_polysemy=20000 --max_seq_len=128 --gpus=0 --batch_size=32 --n_batches=6000000 --dl_warmup_steps=2000000 --ml_warmup_steps=1000000 --dl_r=1.5 --ml_coeff=0.1 --learning_rate=0.00003 --print_every=100 --save_every=10000

Using OSCAR

Connect to OSCAR through SSH.
```
ssh <username>@ssh.ccv.brown.edu
```
Note that Windows users need an SSH client like PuTTY. More details here.
Drag the polylm folder into the OSCAR filesystem using SMB
cd to the polylm folder and activate the virtual environment
```
source .venv/bin/activate 
```

Request resources from OSCAR using either in interact or batch mode.

Interact

This requests an interactive session with 20 cores at 10GB per core for 1 hour. Note that you must stay connected to the login node.

interact -n 20 -t 01:00:00 -m 10g
python train.py --model_dir=models/ --corpus_path=data/bookcorpus/books_large_p1.txt --vocab_path=data/bookcorpus/vocab.txt --embedding_size=256 --bert_intermediate_size=1024 --n_disambiguation_layers=4 --n_prediction_layers=12 --max_senses_per_word=8 --min_occurrences_for_vocab=500 --min_occurrences_for_polysemy=20000 --max_seq_len=128 --gpus=0 --batch_size=32 --n_batches=6000000 --dl_warmup_steps=2000000 --ml_warmup_steps=1000000 --dl_r=1.5 --ml_coeff=0.1 --learning_rate=0.00003 --print_every=100 --save_every=100

Batch

This requests a batch job 1 core and 4GB of memory per core for 1 hour.

sbatch oscar_batch.sh

View your active jobs by running myq. In batch mode, you can view the output of the job in the file slurm-<jobid>.out in the directory where you invoked the sbatch command.

Testing

It is possible to use the download scripts provided in the models folder.

cd models
./download-lemmatized-large.sh

First download the SemEval 2010 WSI datasets:

cd data
./download-wsi.sh
cd ..

Activate NLTK's WordNet capabilities:

python -c "import nltk; nltk.download('wordnet')"

Download Stanford CoreNLP's part-of-speech tagger v3.9.2 and put the folder in the root. It is required to perform lemmatization when evaluating on WSI.

PolyLM evaluation can be performed as follows:

./wsi.sh data/wsi/SemEval-2010 SemEval-2010 ./models/polylm-lemmatized-large --gpus 0 --pos_tagger_root ./stanford-postagger-2018-10-16

Note that inference is only supported on a single GPU currently, but is generally very fast.

qiaochloe / polylm

PolyLM

Setting up repository

Training

Using OSCAR

Testing

About

Languages