maciek-pioro/poleval-2021

PolEval 2021 submissions

This repository contains winning submissions for Task 3: Post-correction of OCR results and Task 4: Question answering challenge. Both submission rely on fine-tuning the mT5 model on respective tasks.

Solution details are described in the workshop proceedings

Task 3 results

model	dev-0	test-A	test-B
original	16.550	16.527	16.543
base	4.678	4.792	4.796
large	4.418	4.515	4.559
XXL	3.604	3.725	3.744

Task 4 results

model	test-B
base	52.12
large	59.20
XXL	71.68

Data preparation

Setup

Common steps for both tasks

Install pip requirements

pip install -r requirements.txt

Download mT5 vocabulary to repository root

gsutil cp gs://t5-data/vocabs/mc4.250000.100extra/sentencepiece.model .

Prepare GCS bucket for storing training datasets: https://cloud.google.com/storage/docs/creating-buckets
Update gs_base_path in config/config.yaml

OCR correction

The provided data contains pages of text which are in many instances longer then maximum sequence length allowed by the model architecture. To alleviate that the training examples are created by aligning and splitting longer input/output pairs.

Pull task repository

git clone -b secret https://github.com/poleval/2021-ocr-correction.git

Split examples into chunks to match maximum sequence length

python3 -m data_preparation.ocr_correction.split_text \
  2021-ocr-correction \
  --length-limit 384

Upload files to created bucket, update or match paths from config/task/ocr_correction.yaml. Keep .index files to restore full text from predictions

Question answering

For question answering the model input prompt consists of question and context passages retrieved from Wikipedia. This section shows how to reproduce the data used in submission.

The prepared data is available here. Skip to step 5 if using this dataset.

Pull task repository

git clone -b secret https://github.com/poleval/2021-question-answering.git

Start local Elasticsearch instance using docker (skip if using existing cluster)

docker volume create poleval-es # recommended for persistence
docker run \
  -p 9200:9200 \
  -p 9300:9300 \
  -v poleval-es:/usr/share/elasticsearch/data \
  -e "discovery.type=single-node" \
  docker.elastic.co/elasticsearch/elasticsearch:7.13.4

Download spaCy model

python -m spacy download pl_core_news_md

Index and retrieve context passages for Polish QA dataset

python3 -m data_preparation.question_answering.quiz_pl \
  2021-question-answering \
  wiki_passages_pl

Index and retrieve context passages for TriviaQA dataset

python3 -m data_preparation.question_answering.trivia_qa wiki_passages_en

Select questions only for prediction

cat test-B-input-510.tsv | cut -f1 > test-B-questions-510.tsv

Upload files to created bucket, update or match paths from config/task/question_answering.yaml

Training and evaluation

The models were trained using TPUv3 device. Model configuration is defined in config/ folder. After completing the training inference will be run using prompts from files specified under config/task/<task.yaml> -> predict_files

Start TPUv3 and cloud instance eg. using ctpu tool

ctpu up --name poleval --tpu-size=v3-8 -tf-version 2.5.0

SSH to TPU instance, download this repository and install the requirements
Start the training (or resume from the latest checkpoint) specifying task and model configuration

python3 main.py model=xxl task=question_answering +tpu_name=poleval

(OCR only) Concatenate the corrected fragments to produce source text

python3 -m data_preparation.ocr_correction.restore \
  gs://my-bucket/data/ocr/dev-0-input-384.txt-1100000 \
  dev-0-384.index \
  dev-0-restored.txt

Evaluate results using geval tool

cd 2021-question-answering # or 2021-ocr-correction
gsutil cp gs://my-bucket/data/polish_qa/test-B-questions-510.tsv-1010000 test-B/out.tsv
./geval --test-name test-B

Acknowledgments

Research supported with Cloud TPUs from Google's TPU Research Cloud (TRC)

maciek-pioro / poleval-2021