Setting up repository

  1. Set up virtual environment.

    python3.10 -m venv .venv 
  2. Activate your environment

    .\.venv\scripts\activate # Windows
    source .venv/bin/activate # MacOS
  3. Install PyTorch using the instructions on this site. Choose the stable, pip, python, and default installation.

  4. Install other packages.

    pip install -r requirements.txt


For training, download the corpus dataset and place it in the data/ folder. Then run the preprocessing.py script.

The following command can be used to train a model with the same parameters as PolyLMBASE:

python train.py --model_dir=models/ --corpus_path=data/bookcorpus/books_large_p1.txt --vocab_path=data/bookcorpus/vocab.txt --embedding_size=256 --bert_intermediate_size=1024 --n_disambiguation_layers=4 --n_prediction_layers=12 --max_senses_per_word=8 --min_occurrences_for_vocab=500 --min_occurrences_for_polysemy=20000 --max_seq_len=128 --gpus=0 --batch_size=32 --n_batches=6000000 --dl_warmup_steps=2000000 --ml_warmup_steps=1000000 --dl_r=1.5 --ml_coeff=0.1 --learning_rate=0.00003 --print_every=100 --save_every=10000


  1. Connect to OSCAR through SSH.

    ssh <username>@ssh.ccv.brown.edu

    Note that Windows users need an SSH client like PuTTY. More details here.

  2. Drag the polylm folder into the OSCAR filesystem using SMB

  3. cd to the polylm folder and activate the virtual environment

    source .venv/bin/activate 
  4. Request resources from OSCAR using either in interact or batch mode.


    This requests an interactive session with 20 cores at 10GB per core for 1 hour. Note that you must stay connected to the login node.

    interact -n 20 -t 01:00:00 -m 10g
    python train.py --model_dir=models/ --corpus_path=data/bookcorpus/books_large_p1.txt --vocab_path=data/bookcorpus/vocab.txt --embedding_size=256 --bert_intermediate_size=1024 --n_disambiguation_layers=4 --n_prediction_layers=12 --max_senses_per_word=8 --min_occurrences_for_vocab=500 --min_occurrences_for_polysemy=20000 --max_seq_len=128 --gpus=0 --batch_size=32 --n_batches=6000000 --dl_warmup_steps=2000000 --ml_warmup_steps=1000000 --dl_r=1.5 --ml_coeff=0.1 --learning_rate=0.00003 --print_every=100 --save_every=100


    This requests a batch job 1 core and 4GB of memory per core for 1 hour.

    sbatch oscar_batch.sh

View your active jobs by running myq. In batch mode, you can view the output of the job in the file slurm-<jobid>.out in the directory where you invoked the sbatch command.


It is possible to use the download scripts provided in the models folder.

cd models

First download the SemEval 2010 WSI datasets:

cd data
cd ..

Activate NLTK's WordNet capabilities:

python -c "import nltk; nltk.download('wordnet')"

Download Stanford CoreNLP's part-of-speech tagger v3.9.2 and put the folder in the root. It is required to perform lemmatization when evaluating on WSI.

PolyLM evaluation can be performed as follows:

./wsi.sh data/wsi/SemEval-2010 SemEval-2010 ./models/polylm-lemmatized-large --gpus 0 --pos_tagger_root ./stanford-postagger-2018-10-16

Note that inference is only supported on a single GPU currently, but is generally very fast.



