🌐 KazParC 📝

🤗

This repository provides a dataset and a neural machine translation model nicknamed Tilmash for the paper
KazParC: Kazakh Parallel Corpus for Machine Translation

Domains ℹ️

We collected data for our Kazakh Parallel Corpus (referred to as KazParC) from a diverse range of textual sources for Kazakh, English, Russian, and Turkish. These sources include

proverbs and sayings
terminology glossaries
phrasebooks
literary works
periodicals
language learning materials, including the SCoRE corpus by Chujo et al. (2015)
educational video subtitle collections, such as QED by Abdelali et al. (2014)
news items, such as KazNERD (Yeshpanov et al., 2022) and WMT (Tiedemann, 2012)
TED talks
governmental and regulatory legal documents from Kazakhstan
communications from the official website of the President of the Republic of Kazakhstan
United Nations publications
image captions from sources like COCO

We categorised the data acquired from these sources into five broad domains:

Domain	lines		tokens

			EN		KK		RU		TR

	#	%	#	%	#	%	#	%	#	%
Mass media	120,547	32.4	1,817,276	28.3	1,340,346	28.6	1,454,430	29.0	1,311,985	28.5
General	94,988	25.5	844,541	13.1	578,236	12.3	618,960	12.3	608,020	13.2
Legal documents	77,183	20.8	2,650,626	41.3	1,925,561	41.0	1,991,222	39.7	1,880,081	40.8
Education and science	46,252	12.4	522,830	8.1	392,348	8.4	444,786	8.9	376,484	8.2
Fiction	32,932	8.9	589,001	9.2	456,385	9.7	510,168	10.2	433,968	9.4
Total	371,902	100	6,424,274	100	4,692,876	100	5,019,566	100	4,610,538	100

Data Collection 📅

We started the data collection process in July 2021, and it continued until September 2023. During this period, we collected a vast amount of text materials and their translations.

Our team of linguists played a crucial role in ensuring the quality of the data. They carefully reviewed the collected data, screening it for inappropriate content. The next step involved segmenting the data into individual sentences, with each sentence labelled with a domain identifier. We also paid close attention to grammar and spelling accuracy and removed any duplicate sentences.

Kazakh-Russian code-switching is a common practice in Kazakhstan, so we took steps to maintain uniformity. For sentences containing both Kazakh and Russian words, we initiated a modification process. This process involved translating the Russian elements into Kazakh while preserving the intended meaning of the sentences.

Data Pre-Processing 🧹

We organised the data into language pairs. We then carefully removed any unwanted characters and effectively replaced homoglyphs. We also took care of formatting issues by eliminating line breaks (\n) and carriage returns (\r). We identified and removed duplicate entries, making sure to filter out rows with identical text in both language columns. However, to make our corpus more diverse and include a broader range of synonyms for different words and expressions, we decided to keep lines with duplicate text within a single language column.

In the table below, you will find statistics regarding the language pairs present in our corpus. The column labelled '# lines' shows the total number of rows for each language pair. In the columns labelled '# sents', '# tokens', and '# types', we provide counts of unique sentences, tokens, and word types for each language pair. For these counts, the upper numbers correspond to the first language in the pair, and the lower numbers correspond to the second language. These token and type counts were determined after processing the data using Moses Tokenizer 1.2.1.

Pair	# lines	# sents	# tokens	# types
KK↔EN	363,594	362,230 361,087	4,670,789 6,393,381	184,258 59,062
KK↔RU	363,482	362,230 362,748	4,670,593 4,996,031	184,258 183,204
KK↔TR	362,150	362,230 361,660	4,668,852 4,586,421	184,258 175,145
EN↔RU	363,456	361,087 362,748	6,392,301 4,994,310	59,062 183,204
EN↔TR	362,392	361,087 361,660	6,380,703 4,579,375	59,062 175,145
RU↔TR	363,324	362,748 361,660	4,999,850 4,591,847	183,204 175,145

Data Splitting ✂️

We began by creating a test set. To do this, we employed a random selection process, carefully choosing 250 unique and non-repeating rows from each of the sources outlined in Domains. The remaining data were divided into pairs, following an 80/20 split, while ensuring that the distribution of domains was maintained within both the training and validation sets.

Pair	Train				Valid				Test

	# lines	# sents	# tokens	# types	# lines	# sents	# tokens	# lines	# lines	# sents	# tokens	# lines
KK↔EN	290,877	286,958 286,197	3,693,263 5,057,687	164,766 54,311	72,719	72,426 72,403	920,482 1,259,827	83,057 32,063	4,750	4,750 4,750	57,044 75,867	17,475 9,729
KK↔RU	290,785	286,943 287,215	3,689,799 3,945,741	164,995 165,882	72,697	72,413 72,439	923,750 988,374	82,958 87,519	4,750	4,750 4,750	57,044 61,916	17,475 18,804
KK↔TR	289,720	286,694 286,279	3,691,751 3,626,361	164,961 157,460	72,430	72,211 72,190	920,057 904,199	82,698 80,885	4,750	4,750 4,750	57,044 55,861	17,475 17,284
EN↔RU	290,764	286,185 287,261	5,058,530 3,950,362	54,322 165,701	72,692	72,377 72,427	1,257,904 982,032	32,208 87,541	4,750	4,750 4,750	75,867 61,916	9,729 18,804
EN↔TR	289,913	285,967 286,288	5,048,274 3,621,531	54,224 157,369	72,479	72,220 72,219	1,256,562 901,983	32,269 80,838	4,750	4,750 4,750	75,867 55,861	9,729 17,284
RU↔TR	290,899	287,241 286,475	3,947,809 3,626,436	165,482 157,470	72,725	72,455 72,362	990,125 909,550	87,831 80,962	4,750	4,750 4,750	61,916 55,861	18,804 17,284

Synthetic Corpus 🧪

To make our parallel corpus more extensive and diverse and to explore how well our translation models perform when dealing with a combination of human-translated and machine-translated content, we carried out web crawling to gather a total of 1,797,066 sentences from English-language websites. These sentences were then automatically translated into Kazakh, Russian, and Turkish using the Google Translate service. In the context of our research, we refer to this collection of data as 'SynC' (Synthetic Corpus).

Pair	# lines	# sents	# tokens	# types
KK↔EN	1,787,050	1,782,192 1,781,019	26,630,960 35,291,705	685,135 300,556
KK↔RU	1,787,448	1,782,192 1,777,500	26,654,195 30,241,895	685,135 672,146
KK↔TR	1,791,425	1,782,192 1,782,257	26,726,439 27,865,860	685,135 656,294
EN↔RU	1,784,513	1,781,019 1,777,500	35,244,800 30,175,611	300,556 672,146
EN↔TR	1,788,564	1,781,019 1,782,257	35,344,188 27,806,708	300,556 656,294
RU↔TR	1,788,027	1,777,500 1,782,257	30,269,083 27,816,210	672,146 656,294

We further divided the synthetic corpus into training and validation sets with a 90/10 ratio.

Pair	Train				Valid

	# lines	# sents	# tokens	# types	# lines	# sents	# tokens	# types
KK↔EN	1,608,345	1,604,414 1,603,426	23,970,260 31,767,617	650,144 286,372	178,705	178,654 178,639	2,660,700 3,524,088	208,838 105,517
KK↔RU	1,608,703	1,604,468 1,600,643	23,992,148 27,221,583	650,170 642,604	178,745	178,691 178,642	2,662,047 3,020,312	209,188 235,642
KK↔TR	1,612,282	1,604,793 1,604,822	24,053,671 25,078,688	650,384 626,724	179,143	179,057 179,057	2,672,768 2,787,172	209,549 221,773
EN↔RU	1,606,061	1,603,199 1,600,372	31,719,781 27,158,101	286,645 642,686	178,452	178,419 178,379	3,525,019 3,017,510	104,834 235,069
EN↔TR	1,609,707	1,603,636 1,604,545	31,805,393 25,022,782	286,387 626,740	178,857	178,775 178,796	3,538,795 2,783,926	105,641 221,372
RU↔TR	1,609,224	1,600,605 1,604,521	27,243,278 25,035,274	642,797 626,587	178,803	178,695 178,750	3,025,805 2,780,936	235,970 221,792

Data Vectorisation 🧮

The data underwent vectorisation using HuggingFace's transformers and datasets libraries. Each language pair was vectorised individually based on the source and target languages within the pair. Subsequently, the vectorised data sets were combined into unified training and validation sets, each comprising 6 language pairs for bidirectional translation purposes. For more details, see data_tokenization.ipynb.

Corpus Structure 🗂️

The corpus is organized into two distinct groups based on their file prefixes. Files "01" through "19" have the "kazparc" prefix, while Files "20" to "32" have the "sync" prefix.

├── kazparc
   ├── 01_kazparc_all_entries.csv
   ├── 02_kazparc_train_kk_en.csv
   ├── 03_kazparc_train_kk_ru.csv
   ├── 04_kazparc_train_kk_tr.csv
   ├── 05_kazparc_train_en_ru.csv
   ├── 06_kazparc_train_en_tr.csv
   ├── 07_kazparc_train_ru_tr.csv
   ├── 08_kazparc_valid_kk_en.csv
   ├── 09_kazparc_valid_kk_ru.csv
   ├── 10_kazparc_valid_kk_tr.csv
   ├── 11_kazparc_valid_en_ru.csv
   ├── 12_kazparc_valid_en_tr.csv
   ├── 13_kazparc_valid_ru_tr.csv
   ├── 14_kazparc_test_kk_en.csv
   ├── 15_kazparc_test_kk_ru.csv
   ├── 16_kazparc_test_kk_tr.csv
   ├── 17_kazparc_test_en_ru.csv
   ├── 18_kazparc_test_en_tr.csv
   ├── 19_kazparc_test_ru_tr.csv
├── sync
   ├── 20_sync_all_entries.csv
   ├── 21_sync_train_kk_en.csv
   ├── 22_sync_train_kk_ru.csv
   ├── 23_sync_train_kk_tr.csv
   ├── 24_sync_train_en_ru.csv
   ├── 25_sync_train_en_tr.csv
   ├── 26_sync_train_ru_tr.csv
   ├── 27_sync_valid_kk_en.csv
   ├── 28_sync_valid_kk_ru.csv
   ├── 29_sync_valid_kk_tr.csv
   ├── 30_sync_valid_en_ru.csv
   ├── 31_sync_valid_en_tr.csv
   ├── 32_sync_valid_ru_tr.csv

KazParC files:

File "01" contains the original, unprocessed text data for the four languages considered within KazParC.
Files "02" through "19" represent pre-processed texts divided into language pairs for training (Files "02" to "07"), validation (Files "08" to "13"), and testing (Files "14" to "19"). Language pairs are indicated within the filenames using two-letter language codes (e.g., kk_en).

SynC files:

File "20" contains raw, unprocessed text data for the four languages.
Files "21" to "32" contain pre-processed text divided into language pairs for training (Files "21" to "26") and validation (Files "27" to "32") purposes.

In both "01" and "20", each line consists of specific components: a unique line identifier (id), texts in Kazakh (kk), English (en), Russian (ru), and Turkish (tr), along with accompanying domain information (domain). For the other files, the data fields are id, source_lang, target_lang, domain, and the language pair (e.g., kk_en.).

Experimental Setup 🔬

In our study, we used Facebook's NLLB model, which supports translation for a wide range of languages, including Kazakh, English, Russian, and Turkish. To assess the performance of the model, we initially tested two versions: the baseline and the distilled models. We fine-tuned these versions on KazParC data. After comparing their results, we found that the distilled model consistently outperformed the baseline, though the difference was relatively small, with an improvement of just 0.01 BLEU score. Consequently, we focused our subsequent experiments exclusively on fine-tuning the distilled model.

We trained a total of four models:

'base', the off-the-shelf model.
'parc', fine-tuned on KazParC data.
'sync', fine-tuned on SynC data.
'parsync', fine-tuned on both KazParC and SynC data.

We fine-tuned these models using hyperparameters tuned with validation sets. We included synthetic data in the validation sets only when assessing the performance of the 'sync' and 'parsync' models. The best-performing models were then evaluated on the test sets.

In addition to the KazParC test set, we used the FLoRes dataset. We merged the dev and devtest sets from FLoRes into one set for our evaluation. We also explored language pairs, such as German-French, German-Ukrainian, and French-Uzbek, to assess how fine-tuning the model affected translation quality for different language pairs.

All the models were fine-tuned using eight GPUs on an NVIDIA DGX A100 machine. We initially set a learning rate of 2 × 10^-5 and used the AdaFactor optimization algorithm. The training process spanned three epochs, with both the training and evaluation batch sizes set to 8. To start training the model, create a virtual environment and install the necessary requirements from the environment.yaml file:

conda create --name kazparc python=3.8.17
conda env update --name kazparc --file environment.yaml

Once you have completed the above steps, you are ready to run the train.py script using the command:

python3 -m torch.distributed.launch --nproc_per_node 8 --nnodes 1 train.py

Evaluation Metrics 📏

In our evaluation of machine translation models, we used two widely recognised metrics:

BLEU, based on precision in 4-grams, measures how closely machine-produced translations match human references.
chrF evaluates translation quality by considering character n-grams, making it well-suited for languages with complex morphologies (e.g., Kazakh and Turkish). chrF calculates the harmonic mean of character-based precision and recall, offering a robust evaluation of translation performance.

Both BLEU and chrF scores range from 0 to 1, where higher scores indicate better translation quality.

Experiment Results 📈

We translated the test dataset using the translate_test_set.py script. To obtain the BLEU and ChrF metrics we used evaluation.ipynb. Below are the results we obtained from evaluating the Tilmash model on the KazParC and FLoRes test datasets.

Pair	FLoRes Test Set

	base	parc	sync	parsync	Yandex	Google
EN↔KK	0.11 \| 0.49	0.14 \| 0.56	0.20 \| 0.60	0.20 \| 0.60	0.18 \| 0.58	0.20 \| 0.60
EN↔RU	0.25 \| 0.56	0.26 \| 0.58	0.28 \| 0.60	0.28 \| 0.60	0.32 \| 0.63	0.31 \| 0.62
EN↔TR	0.19 \| 0.58	0.22 \| 0.61	0.27 \| 0.65	0.27 \| 0.65	0.29 \| 0.66	0.30 \| 0.66
KK↔EN	0.28 \| 0.59	0.32 \| 0.62	0.31 \| 0.62	0.32 \| 0.63	0.30 \| 0.62	0.36 \| 0.65
KK↔RU	0.15 \| 0.49	0.17 \| 0.51	0.18 \| 0.52	0.18 \| 0.52	0.18 \| 0.52	0.20 \| 0.53
KK↔TR	0.09 \| 0.48	0.13 \| 0.52	0.14 \| 0.54	0.14 \| 0.54	0.12 \| 0.52	0.17 \| 0.56
RU↔EN	0.31 \| 0.62	0.32 \| 0.63	0.32 \| 0.63	0.32 \| 0.63	0.33 \| 0.64	0.35 \| 0.65
RU↔KK	0.08 \| 0.49	0.10 \| 0.52	0.13 \| 0.53	0.13 \| 0.54	0.12 \| 0.54	0.13 \| 0.54
RU↔TR	0.10 \| 0.49	0.12 \| 0.52	0.14 \| 0.54	0.14 \| 0.54	0.13 \| 0.54	0.17 \| 0.56
TR↔EN	0.34 \| 0.64	0.35 \| 0.65	0.36 \| 0.66	0.36 \| 0.66	0.38 \| 0.67	0.39 \| 0.67
TR↔KK	0.07 \| 0.45	0.10 \| 0.51	0.13 \| 0.54	0.13 \| 0.54	0.12 \| 0.53	0.13 \| 0.54
TR↔RU	0.15 \| 0.48	0.17 \| 0.51	0.18 \| 0.52	0.19 \| 0.53	0.20 \| 0.54	0.21 \| 0.54
Average	0.18 \| 0.53	0.20 \| 0.56	0.22 \| 0.58	0.22 \| 0.58	0.23 \| 0.58	0.25 \| 0.59

BLEU | chrF scores for models on the FLoRes test

Pair	KazParC Test Set

	base	parc	sync	parsync	Yandex	Google
EN↔KK	0.12 \| 0.51	0.18 \| 0.58	0.18 \| 0.58	0.21 \| 0.60	0.18 \| 0.58	0.30 \| 0.65
EN↔RU	0.31 \| 0.64	0.38 \| 0.68	0.35 \| 0.66	0.38 \| 0.68	0.39 \| 0.70	0.41 \| 0.71
EN↔TR	0.19 \| 0.59	0.22 \| 0.62	0.25 \| 0.63	0.25 \| 0.64	0.27 \| 0.64	0.34 \| 0.68
KK↔EN	0.24 \| 0.55	0.33 \| 0.62	0.24 \| 0.57	0.32 \| 0.62	0.28 \| 0.60	0.31 \| 0.62
KK↔RU	0.22 \| 0.56	0.29 \| 0.63	0.24 \| 0.59	0.29 \| 0.63	0.29 \| 0.63	0.29 \| 0.61
KK↔TR	0.10 \| 0.47	0.15 \| 0.54	0.14 \| 0.52	0.16 \| 0.55	0.13 \| 0.52	0.23 \| 0.59
RU↔EN	0.34 \| 0.63	0.43 \| 0.71	0.34 \| 0.65	0.42 \| 0.70	0.43 \| 0.71	0.42 \| 0.71
RU↔KK	0.15 \| 0.55	0.21 \| 0.61	0.18 \| 0.58	0.22 \| 0.62	0.23 \| 0.62	0.24 \| 0.62
RU↔TR	0.11 \| 0.49	0.16 \| 0.56	0.16 \| 0.55	0.18 \| 0.57	0.16 \| 0.55	0.22 \| 0.60
TR↔EN	0.31 \| 0.61	0.38 \| 0.67	0.32 \| 0.63	0.38 \| 0.66	0.36 \| 0.66	0.37 \| 0.66
TR↔KK	0.08 \| 0.46	0.14 \| 0.53	0.14 \| 0.52	0.16 \| 0.55	0.14 \| 0.53	0.19 \| 0.57
TR↔RU	0.17 \| 0.50	0.23 \| 0.56	0.20 \| 0.54	0.24 \| 0.57	0.23 \| 0.57	0.26 \| 0.58
Average	0.20 \| 0.55	0.27 \| 0.61	0.23 \| 0.59	0.27 \| 0.62	0.26 \| 0.61	0.30 \| 0.63

BLEU | chrF scores for models on the KazParC test

After a comprehensive analysis of both qualitative and quantitative outcomes, we have found that the 'parsync' model, which was fine-tuned on a mix of the KazParC corpus and synthetic data, emerged as the top-performing model. Let us simply call this model Tilmash, a Kazakh term that means 'interpreter' or 'translator'.

Pair	BLEU		chrF

	base	Tilmash	base	Tilmash
DE→FR	0.33	0.28	0.61	0.58
FR→DE	0.22	0.19	0.55	0.53
DE→UK	0.15	0.04	0.49	0.36
UK→DE	0.19	0.16	0.53	0.50
FR→UZ	0.06	0.02	0.48	0.31
UZ→FR	0.25	0.22	0.56	0.53

Results of the base and Tilmash models on the control language pairs on the FLoRes test set

Pair	Type	Text	BLEU	chrF
KK→EN	source	Ыстық және желді. Ystyq jane jeldi.

	reference	It is hot and windy.	1.00	1.00

	Tilmash	It's hot and windy.	0.55	0.81

	Yandex	Hot and windy.	0.00	0.66

	Google	Hot and windy.	0.00	0.66
KK→EN	source	1 қыркүйекте бесінші ана өлімі тіркелді. 1 qyrkuiekte besinshi ana olimi tirkeldi.

	reference	On September 1, the fifth maternal death was registered.	1.00	1.00

	Tilmash	A fifth maternal death was recorded on 1 September.	0.27	0.63

	Yandex	On September 1, the fifth maternal death was registered.	1.00	1.00

	Google	On September 1, the fifth maternal death was recorded.	0.81	0.86

A selection of translation outputs from Tilmash, Yandex, and Google

Below are the detailed tables of Tilmash, Yandex, and Google results per domain.

EDUCATION AND SCIENCE
Pair	Tilmash		Yandex		Google

	BLEU	chrF	BLEU	chrF	BLEU	chrF
EN→KK	0.23	0.63	0.19	0.61	0.44	0.73
EN→RU	0.39	0.74	0.39	0.76	0.43	0.78
EN→TR	0.33	0.71	0.37	0.74	0.47	0.79
KK→EN	0.28	0.64	0.27	0.63	0.32	0.66
KK→RU	0.26	0.66	0.26	0.66	0.32	0.66
KK→TR	0.20	0.60	0.15	0.57	0.29	0.66
RU→EN	0.38	0.73	0.40	0.75	0.40	0.76
RU→KK	0.21	0.64	0.22	0.65	0.30	0.67
RU→TR	0.24	0.65	0.22	0.65	0.33	0.70
TR→EN	0.38	0.70	0.38	0.70	0.40	0.71
TR→KK	0.19	0.58	0.17	0.56	0.29	0.64
TR→RU	0.27	0.63	0.29	0.65	0.33	0.68

FICTION
Pair	Tilmash		Yandex		Google

	BLEU	chrF	BLEU	chrF	BLEU	chrF
EN→KK	0.13	0.51	0.15	0.52	0.19	0.53
EN→RU	0.35	0.64	0.34	0.66	0.37	0.66
EN→TR	0.28	0.62	0.29	0.63	0.53	0.74
KK→EN	0.29	0.57	0.24	0.54	0.29	0.58
KK→RU	0.25	0.58	0.23	0.55	0.25	0.57
KK→TR	0.26	0.62	0.18	0.56	0.50	0.77
RU→EN	0.40	0.66	0.41	0.67	0.42	0.68
RU→KK	0.17	0.55	0.19	0.56	0.16	0.55
RU→TR	0.22	0.59	0.17	0.55	0.36	0.67
TR→EN	0.36	0.63	0.35	0.62	0.37	0.64
TR→KK	0.15	0.55	0.16	0.55	0.19	0.58
TR→RU	0.24	0.56	0.24	0.56	0.26	0.58

GENERAL
Pair	Tilmash		Yandex		Google

	BLEU	chrF	BLEU	chrF	BLEU	chrF
EN→KK	0.26	0.68	0.17	0.62	0.45	0.77
EN→RU	0.46	0.76	0.44	0.77	0.48	0.79
EN→TR	0.12	0.54	0.12	0.54	0.12	0.55
KK→EN	0.39	0.68	0.29	0.64	0.33	0.65
KK→RU	0.32	0.68	0.29	0.66	0.30	0.66
KK→TR	0.10	0.52	0.08	0.47	0.11	0.51
RU→EN	0.45	0.74	0.39	0.71	0.38	0.70
RU→KK	0.22	0.66	0.18	0.63	0.22	0.65
RU→TR	0.11	0.52	0.09	0.49	0.09	0.51
TR→EN	0.32	0.62	0.27	0.59	0.28	0.60
TR→KK	0.14	0.55	0.10	0.50	0.16	0.56
TR→RU	0.22	0.57	0.18	0.57	0.21	0.58

LEGAL DOCUMENTS
Pair	Tilmash		Yandex		Google

	BLEU	chrF	BLEU	chrF	BLEU	chrF
EN→KK	0.27	0.67	0.28	0.67	0.29	0.68
EN→RU	0.48	0.75	0.46	0.76	0.47	0.76
EN→TR	0.22	0.64	0.23	0.64	0.25	0.55
KK→EN	0.41	0.69	0.34	0.65	0.36	0.66
KK→RU	0.47	0.77	0.45	0.76	0.38	0.71
KK→TR	0.11	0.54	0.11	0.53	0.13	0.54
RU→EN	0.52	0.76	0.52	0.76	0.51	0.76
RU→KK	0.37	0.74	0.38	0.75	0.33	0.71
RU→TR	0.14	0.57	0.13	0.56	0.15	0.58
TR→EN	0.46	0.72	0.39	0.69	0.43	0.70
TR→KK	0.18	0.58	0.15	0.56	0.18	0.58
TR→RU	0.29	0.63	0.22	0.59	0.27	0.61

MASS MEDIA
Pair	Tilmash		Yandex		Google

	BLEU	chrF	BLEU	chrF	BLEU	chrF
EN→KK	0.18	0.58	0.17	0.58	0.19	0.59
EN→RU	0.35	0.67	0.38	0.70	0.40	0.70
EN→TR	0.30	0.66	0.31	0.67	0.41	0.72
KK→EN	0.32	0.62	0.32	0.62	0.33	0.62
KK→RU	0.27	0.61	0.29	0.62	0.26	0.59
KK→TR	0.18	0.57	0.16	0.55	0.26	0.62
RU→EN	0.48	0.73	0.53	0.76	0.50	0.74
RU→KK	0.21	0.60	0.22	0.62	0.20	0.59
RU→TR	0.22	0.60	0.18	0.58	0.26	0.63
TR→EN	0.40	0.68	0.40	0.68	0.41	0.69
TR→KK	0.15	0.55	0.14	0.54	0.17	0.57
TR→RU	0.22	0.57	0.24	0.58	0.25	0.59

Using Tilmash 🚀

To translate text, you can utilise the predict.py script. To get started, make sure to download Tilmash from our Hugging Face repository. In the script, you will need to specify the source and target languages using the src and trg variables. You can choose from the following language values:

Kazakh: kaz_Cyrl
Russian: rus_Cyrl
English: eng_Latn
Turkish: tur_Latn

Once you have set the languages, simply input the text you want to translate into the text variable.

Acknowledgements 🙏

We wish to convey our deep appreciation to the diligent group of translators whose exceptional contributions have been crucial to the successful realisation of this study. Their tireless efforts to ensure the accuracy and faithful rendition of the source materials have indeed proved invaluable. Our sincerest thanks go to the following esteemed individuals: Aigerim Baidauletova, Aigerim Boranbayeva, Ainagul Akmuldina, Aizhan Seipanova, Askhat Kenzhegulov, Assel Kospabayeva, Assel Mukhanova, Elmira Nikiforova, Gaukhar Rayanova, Gulim Kabidolda, Gulzhanat Abduldinova, Indira Yerkimbekova, Moldir Orazalinova, Saltanat Kemaliyeva, and Venera Spanbayeva.

Citation 🎓

We kindly urge you, if you incorporate our dataset and/or model into your work, to cite our paper as a gesture of recognition for its valuable contribution. The act of referencing the relevant sources not only upholds academic honesty but also ensures proper acknowledgement of the authors' efforts. Your citation in your research significantly contributes to the continuous progress and evolution of the scholarly realm. Your endorsement and acknowledgement of our endeavours are genuinely appreciated.

@misc{yeshpanov2024kazparc,
      title={KazParC: Kazakh Parallel Corpus for Machine Translation}, 
      author={Rustem Yeshpanov and Alina Polonskaya and Huseyin Atakan Varol},
      year={2024},
      eprint={2403.19399},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

About

An open-source parallel corpus for machine translation across Kazakh, English, Russian, and Turkish

Languages

Language:Jupyter Notebook 91.0%Language:Python 9.0%

Pair	FLoRes Test Set

	base	parc	sync	parsync	Yandex	Google
EN↔KK	0.11 \| 0.49	0.14 \| 0.56	0.20 \| 0.60	0.20 \| 0.60	0.18 \| 0.58	0.20 \| 0.60
EN↔RU	0.25 \| 0.56	0.26 \| 0.58	0.28 \| 0.60	0.28 \| 0.60	0.32 \| 0.63	0.31 \| 0.62
EN↔TR	0.19 \| 0.58	0.22 \| 0.61	0.27 \| 0.65	0.27 \| 0.65	0.29 \| 0.66	0.30 \| 0.66
KK↔EN	0.28 \| 0.59	0.32 \| 0.62	0.31 \| 0.62	0.32 \| 0.63	0.30 \| 0.62	0.36 \| 0.65
KK↔RU	0.15 \| 0.49	0.17 \| 0.51	0.18 \| 0.52	0.18 \| 0.52	0.18 \| 0.52	0.20 \| 0.53
KK↔TR	0.09 \| 0.48	0.13 \| 0.52	0.14 \| 0.54	0.14 \| 0.54	0.12 \| 0.52	0.17 \| 0.56
RU↔EN	0.31 \| 0.62	0.32 \| 0.63	0.32 \| 0.63	0.32 \| 0.63	0.33 \| 0.64	0.35 \| 0.65
RU↔KK	0.08 \| 0.49	0.10 \| 0.52	0.13 \| 0.53	0.13 \| 0.54	0.12 \| 0.54	0.13 \| 0.54
RU↔TR	0.10 \| 0.49	0.12 \| 0.52	0.14 \| 0.54	0.14 \| 0.54	0.13 \| 0.54	0.17 \| 0.56
TR↔EN	0.34 \| 0.64	0.35 \| 0.65	0.36 \| 0.66	0.36 \| 0.66	0.38 \| 0.67	0.39 \| 0.67
TR↔KK	0.07 \| 0.45	0.10 \| 0.51	0.13 \| 0.54	0.13 \| 0.54	0.12 \| 0.53	0.13 \| 0.54
TR↔RU	0.15 \| 0.48	0.17 \| 0.51	0.18 \| 0.52	0.19 \| 0.53	0.20 \| 0.54	0.21 \| 0.54
Average	0.18 \| 0.53	0.20 \| 0.56	0.22 \| 0.58	0.22 \| 0.58	0.23 \| 0.58	0.25 \| 0.59

Pair	KazParC Test Set

	base	parc	sync	parsync	Yandex	Google
EN↔KK	0.12 \| 0.51	0.18 \| 0.58	0.18 \| 0.58	0.21 \| 0.60	0.18 \| 0.58	0.30 \| 0.65
EN↔RU	0.31 \| 0.64	0.38 \| 0.68	0.35 \| 0.66	0.38 \| 0.68	0.39 \| 0.70	0.41 \| 0.71
EN↔TR	0.19 \| 0.59	0.22 \| 0.62	0.25 \| 0.63	0.25 \| 0.64	0.27 \| 0.64	0.34 \| 0.68
KK↔EN	0.24 \| 0.55	0.33 \| 0.62	0.24 \| 0.57	0.32 \| 0.62	0.28 \| 0.60	0.31 \| 0.62
KK↔RU	0.22 \| 0.56	0.29 \| 0.63	0.24 \| 0.59	0.29 \| 0.63	0.29 \| 0.63	0.29 \| 0.61
KK↔TR	0.10 \| 0.47	0.15 \| 0.54	0.14 \| 0.52	0.16 \| 0.55	0.13 \| 0.52	0.23 \| 0.59
RU↔EN	0.34 \| 0.63	0.43 \| 0.71	0.34 \| 0.65	0.42 \| 0.70	0.43 \| 0.71	0.42 \| 0.71
RU↔KK	0.15 \| 0.55	0.21 \| 0.61	0.18 \| 0.58	0.22 \| 0.62	0.23 \| 0.62	0.24 \| 0.62
RU↔TR	0.11 \| 0.49	0.16 \| 0.56	0.16 \| 0.55	0.18 \| 0.57	0.16 \| 0.55	0.22 \| 0.60
TR↔EN	0.31 \| 0.61	0.38 \| 0.67	0.32 \| 0.63	0.38 \| 0.66	0.36 \| 0.66	0.37 \| 0.66
TR↔KK	0.08 \| 0.46	0.14 \| 0.53	0.14 \| 0.52	0.16 \| 0.55	0.14 \| 0.53	0.19 \| 0.57
TR↔RU	0.17 \| 0.50	0.23 \| 0.56	0.20 \| 0.54	0.24 \| 0.57	0.23 \| 0.57	0.26 \| 0.58
Average	0.20 \| 0.55	0.27 \| 0.61	0.23 \| 0.59	0.27 \| 0.62	0.26 \| 0.61	0.30 \| 0.63