jiiiisoo / AIC-kpmg2023

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

image

AIC-kpmg2023

Leveraging pretrained models from KoELECTRA and adapting to train on the KorQuAD 2.1 dataset. Specifically,

  • We added data preprocessing
  • We modified the transformer to fit the KorQuAD 2.1 dataset
  • We implemented the sliding window in long context to improve accuracy
  • We created our own Q&A datasets on business report and used them for training

If you want to see backend and frontend of AIC, see AIC-BE / AIC-FE

Preparation

Data Preprocessing

To eliminate unnecessary html tags from data files, run:

python tag_remover.py --task korquad --config_file koelectra-base-v3.json

Training/Validation

You can just clone the KoELECTRA repo into your own computer. Then, overwrite our files in the KoELECTR/finetune directory.

To train this model run:

python run_squad.py --task korquad --config_file koelectra-base-v3.json

To validate this model run:

python run_squad.py --task korquad --config_file koelectra-base-v3_test.json

Making Custom QA Dataset

Making custom dataset in the form of KorQuAD 2.1 form target files

python make_custom_dataset.py --data_dir {directory containing html files} --name 정빈

use name for distinguishing people when more than one are making dataset. (for unique id)

About


Languages

Language:Python 100.0%