Turkish-GEC

This repository contains all the related artifacts (models, datasets, code) of the paper Organic Data-Driven Approach for Turkish Grammatical Error Correction and LLMs

Datasets

The following is an overview of the datasets utilized in this work. The datasets in the top half are synthetic and the bottom ones, the evaluation sets, are humanly annotated. The error type ERRANT refers to the automatic annotation tool ERRANT, which automatically annotates parallel sentences with error-type information. Tokens information is based on OpenAI's tokenizer tiktoken with gpt2 encodings.

Dataset Name	Split	Sentences	Tokens	Error Types	Domain
OSCAR GEC (ours)	Train	2.3m	213.2m	ERRANT	Web
GPT GEC (ours)	Train	100k	3.6m	ERRANT	Web
GECTurk (Kara et al, 2023)	Train	138k	5.8m	25	Newspapers
OSCAR GEC (ours)	Test	2.4k	142k	ERRANT	Web
Movie Reviews (Kara et al, 2023)	Test	300	2.7k	25	Movie Reviews
Turkish Tweets (Koksal et al, 2020a)	Test	2k	116.2k	13	Tweets

In addition to the above datasets, we also open-source the Turkish Spelling Dictionary developed in this work. You may access it from here

Models

The following are our fine-tuned mT5 models for the Turish Grammatical Error Correction task on our two training datasets: OSCAR GEC and GPT GEC. The models are available on HuggingFace:

Model 1: Turkish-OSCAR-GEC

Model 2: Turkish-GPT-GEC

Results

The following are the results of Turkish GEC models on 3 evaluation sets:

Eval set 1: OSCAR GEC (ours)

Model	P	R	F0.5
GPT GEC (mT5)	69.8	44.9	62.8
OSCAR GEC (mT5)	68.7	31.2	55.4
GECTurk (mT5)	42.5	5.7	18.2
GECTurk (Seq Tagger) (Kara et al., 2023)	49.0	3.9	14.7

Eval set 2: Turkish Tweets (Koksal et al, 2020a)

Model	P	R	F0.5
OSCAR GEC (mT5)	85.1	61.3	79.0
GPT GEC (mT5)	77.7	68.9	75.8
GECTurk (Seq Tagger) (Kara et al, 2023)	64.7	19.8	44.5
GECTurk (mT5)	57.2	20.7	42.3

Eval set 3: Movie Reviews (Kara et al, 2023)

Model	P	R	F0.5
GECTurk (Seq Tagger) (Kara et al., 2023)	86.5	76.2	84.2
GECTurk (mT5)	73.1	71.8	72.8
GPT GEC (mT5)	36.0	46.3	37.6
OSCAR GEC (mT5)	30.0	22.5	28.1

About

Datasets, models and code for the Turkish Grammatical Error Correction task