maciek-pioro / poleval-2021

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

PolEval 2021 submissions

This repository contains winning submissions for Task 3: Post-correction of OCR results and Task 4: Question answering challenge. Both submission rely on fine-tuning the mT5 model on respective tasks.

Solution details are described in the workshop proceedings


Task 3 results

model dev-0 test-A test-B
original 16.550 16.527 16.543
base 4.678 4.792 4.796
large 4.418 4.515 4.559
XXL 3.604 3.725 3.744

Task 4 results

model test-B
base 52.12
large 59.20
XXL 71.68

Data preparation

Setup

Common steps for both tasks

  1. Install pip requirements
pip install -r requirements.txt
  1. Download mT5 vocabulary to repository root
gsutil cp gs://t5-data/vocabs/mc4.250000.100extra/sentencepiece.model .
  1. Prepare GCS bucket for storing training datasets: https://cloud.google.com/storage/docs/creating-buckets
  2. Update gs_base_path in config/config.yaml

OCR correction

The provided data contains pages of text which are in many instances longer then maximum sequence length allowed by the model architecture. To alleviate that the training examples are created by aligning and splitting longer input/output pairs.

  1. Pull task repository
git clone -b secret https://github.com/poleval/2021-ocr-correction.git
  1. Split examples into chunks to match maximum sequence length
python3 -m data_preparation.ocr_correction.split_text \
  2021-ocr-correction \
  --length-limit 384
  1. Upload files to created bucket, update or match paths from config/task/ocr_correction.yaml. Keep .index files to restore full text from predictions

Question answering

For question answering the model input prompt consists of question and context passages retrieved from Wikipedia. This section shows how to reproduce the data used in submission.

The prepared data is available here. Skip to step 5 if using this dataset.

  1. Pull task repository
git clone -b secret https://github.com/poleval/2021-question-answering.git
  1. Start local Elasticsearch instance using docker (skip if using existing cluster)
docker volume create poleval-es # recommended for persistence
docker run \
  -p 9200:9200 \
  -p 9300:9300 \
  -v poleval-es:/usr/share/elasticsearch/data \
  -e "discovery.type=single-node" \
  docker.elastic.co/elasticsearch/elasticsearch:7.13.4
  1. Download spaCy model
python -m spacy download pl_core_news_md
  1. Index and retrieve context passages for Polish QA dataset
python3 -m data_preparation.question_answering.quiz_pl \
  2021-question-answering \
  wiki_passages_pl
  1. Index and retrieve context passages for TriviaQA dataset
python3 -m data_preparation.question_answering.trivia_qa wiki_passages_en
  1. Select questions only for prediction
cat test-B-input-510.tsv | cut -f1 > test-B-questions-510.tsv
  1. Upload files to created bucket, update or match paths from config/task/question_answering.yaml

Training and evaluation

The models were trained using TPUv3 device. Model configuration is defined in config/ folder. After completing the training inference will be run using prompts from files specified under config/task/<task.yaml> -> predict_files

  1. Start TPUv3 and cloud instance eg. using ctpu tool
ctpu up --name poleval --tpu-size=v3-8 -tf-version 2.5.0 
  1. SSH to TPU instance, download this repository and install the requirements
  2. Start the training (or resume from the latest checkpoint) specifying task and model configuration
python3 main.py model=xxl task=question_answering +tpu_name=poleval 
  1. (OCR only) Concatenate the corrected fragments to produce source text
python3 -m data_preparation.ocr_correction.restore \
  gs://my-bucket/data/ocr/dev-0-input-384.txt-1100000 \
  dev-0-384.index \
  dev-0-restored.txt
  1. Evaluate results using geval tool
cd 2021-question-answering # or 2021-ocr-correction
gsutil cp gs://my-bucket/data/polish_qa/test-B-questions-510.tsv-1010000 test-B/out.tsv
./geval --test-name test-B 

Acknowledgments

Research supported with Cloud TPUs from Google's TPU Research Cloud (TRC)

About


Languages

Language:Python 99.9%Language:Shell 0.1%