IBM / span-selection-pretraining

Code to create pre-training data for a span selection pre-training task inspired by reading comprehension and an effort to avoid encoding general knowledge in the transformer network itself.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

span-selection-pretraining

Code to create pre-training data for a span selection pre-training task inspired by reading comprehension and an effort to avoid encoding general knowledge in the transformer network itself.

Pre-trained Models

Available through Hugging Face as:

  • michaelrglass/bert-base-uncased-sspt
  • michaelrglass/bert-large-uncased-sspt

Load with: AutoConfig.from_pretrained , AutoTokenizer.from_pretrained , AutoModelForQuestionAnswering.from_pretrained. See run_qa.py for example code.

Installation

  • python setup.py
  • build irsimple.jar (or use pre-built com.ibm.research.ai.irsimple/irsimple.jar)

Data Generation

  • Download a Wikipedia dump and WikiExtractor
    • IBM is not granting a license to any third-party data set. You are responsible for complying with all third-party licenses, if any exist.
python WikiExtractor.py --json --filter_disambig_pages --processes 32 --output wikiextracteddir enwiki-20190801-pages-articles-multistream.xml.bz2
  • Run create_passages.py (this just splits into passages by double newline)
python create_passages.py --wikiextracted wikiextracteddir --output wikipassagesdir
  • Run Lucene indexing
java -cp irsimple.jar com.ibm.research.ai.irsimple.MakeIndex wikipassagesdir wikipassagesindex
  • Run sspt_gen.sh
nohup bash sspt_gen.sh ssptGen wikipassagesdir 2>&1 > querygen.log &
  • And AsyncWriter
nohup java -cp irsimple.jar com.ibm.research.ai.irsimple.AsyncWriter \
  ssptGen \
  wikipassagesindex 2>&1 > instgen.log &

Training

FIXME: rc_data and span_selection_pretraining require a modified version of pytorch-transformers The adaptations needed are in the process of being worked into this repo and a pull request for pytorch-transformers. Hopefully it is relatively clear how it should work.

python span_selection_pretraining.py \
  --bert_model bert-base-uncased \
  --train_dir ssptGen \
  --num_instances 1000000 \
  --save_model rc_1M_base.bin

About

Code to create pre-training data for a span selection pre-training task inspired by reading comprehension and an effort to avoid encoding general knowledge in the transformer network itself.

License:Apache License 2.0


Languages

Language:Python 65.8%Language:Java 33.2%Language:Shell 0.9%