Solution of RuNormAS

Solution of RuNormAS competition in Dialogue2021. Paper.

Solution based on RuGPT3XL. Model was tuned on train data. The best model was trained on train data and lenta news.

Solution is archived 0.96452 (generic) 0.95750 (named) accuracy as mesured by organizers. But if exclude evaluation errors our best model is archived 0.976700 (generic) 0.980977 (named) accuracy

Usage

Example of usage you can see here

Pretrained model here

Prepare solution

Before run all code make the following dirs:

mkdir ../models/xl
mkdir ../data
mkdir ../test_pred

Download data

Download data for train.

/bin/bash ./scripts/get_data_from_repo.sh ../data/

Note! Here no code for download lenta news.

Process data for LM

Read data and make files for LM (for best solution).

/bin/bash ./scripts/process_data_from_repo.sh

Split data on train and valid for validation.

python modules/data/split_data.py

Run training

Install env

sh scripts/install_env_for_gpt.sh

Run finetuning

Finetune RuGPT3XL.

cd scripts
sh deepspeed_xl_runormas_v14_130k_finetune.sh

Predict for competition

Prepare baseline

First of all make prediction for baseline.

python baseline.py

Predict with model

Run the following commands:

cd scripts
sh deepspeed_xl_runormas_v14_130k_finetune.sh
sh xl_runormas_pred_distributed_v14_130k_finetune_10f.sh

Script xl_runormas_pred_distributed_v14_130k_finetune_10f.sh is needed for predict on 10 files that was not predicted with deepspeed_xl_runormas_v14_130k_finetune.sh (this files was trancated while distributed generation).

Post-processing for submission

For post-processing run the following notebook make_prediction.ipynb. This step improve accuracy around 1-2%.

Make archive for submission:

cd ../test_pred/v14_130k_finetune_fixed
7z a submission.zip *

Error analysis

Also we add error_analysis notebook of our best model.

RussianNLP / RuNormAS-solution