span-selection-pretraining

Code to create pre-training data for a span selection pre-training task inspired by reading comprehension and an effort to avoid encoding general knowledge in the transformer network itself.

Pre-trained Models

Available through Hugging Face as:

michaelrglass/bert-base-uncased-sspt
michaelrglass/bert-large-uncased-sspt

Load with: AutoConfig.from_pretrained , AutoTokenizer.from_pretrained , AutoModelForQuestionAnswering.from_pretrained. See run_qa.py for example code.

Installation

python setup.py
build irsimple.jar (or use pre-built com.ibm.research.ai.irsimple/irsimple.jar)
- cd com.ibm.research.ai.irsimple/
- mvn clean compile assembly:single
- (install maven if necessary from https://maven.apache.org/install.html)

Data Generation

Download a Wikipedia dump and WikiExtractor
- IBM is not granting a license to any third-party data set. You are responsible for complying with all third-party licenses, if any exist.

python WikiExtractor.py --json --filter_disambig_pages --processes 32 --output wikiextracteddir enwiki-20190801-pages-articles-multistream.xml.bz2

Run create_passages.py (this just splits into passages by double newline)

python create_passages.py --wikiextracted wikiextracteddir --output wikipassagesdir

Run Lucene indexing

java -cp irsimple.jar com.ibm.research.ai.irsimple.MakeIndex wikipassagesdir wikipassagesindex

Run sspt_gen.sh

nohup bash sspt_gen.sh ssptGen wikipassagesdir 2>&1 > querygen.log &

And AsyncWriter

nohup java -cp irsimple.jar com.ibm.research.ai.irsimple.AsyncWriter \
  ssptGen \
  wikipassagesindex 2>&1 > instgen.log &

Training

FIXME: rc_data and span_selection_pretraining require a modified version of pytorch-transformers The adaptations needed are in the process of being worked into this repo and a pull request for pytorch-transformers. Hopefully it is relatively clear how it should work.

python span_selection_pretraining.py \
  --bert_model bert-base-uncased \
  --train_dir ssptGen \
  --num_instances 1000000 \
  --save_model rc_1M_base.bin

About

Code to create pre-training data for a span selection pre-training task inspired by reading comprehension and an effort to avoid encoding general knowledge in the transformer network itself.

Apache License 2.0

Languages

Language:Python 65.8%Language:Java 33.2%Language:Shell 0.9%