IS2AI / KazParC

An open-source parallel corpus for machine translation across Kazakh, English, Russian, and Turkish

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

🌐 KazParC 📝

GitHub stars GitHub issues ISSAI Official Website
🤗
Hugging Face Dataset Hugging Face Model

This repository provides a dataset and a neural machine translation model nicknamed Tilmash for the paper
KazParC: Kazakh Parallel Corpus for Machine Translation

Domains ℹ️

We collected data for our Kazakh Parallel Corpus (referred to as KazParC) from a diverse range of textual sources for Kazakh, English, Russian, and Turkish. These sources include

We categorised the data acquired from these sources into five broad domains:

Domain lines tokens
EN KK RU TR
# % # % # % # % # %
Mass media 120,547 32.4 1,817,276 28.3 1,340,346 28.6 1,454,430 29.0 1,311,985 28.5
General 94,988 25.5 844,541 13.1 578,236 12.3 618,960 12.3 608,020 13.2
Legal documents 77,183 20.8 2,650,626 41.3 1,925,561 41.0 1,991,222 39.7 1,880,081 40.8
Education and science 46,252 12.4 522,830 8.1 392,348 8.4 444,786 8.9 376,484 8.2
Fiction 32,932 8.9 589,001 9.2 456,385 9.7 510,168 10.2 433,968 9.4
Total 371,902 100 6,424,274 100 4,692,876 100 5,019,566 100 4,610,538 100

Data Collection 📅

We started the data collection process in July 2021, and it continued until September 2023. During this period, we collected a vast amount of text materials and their translations.

Our team of linguists played a crucial role in ensuring the quality of the data. They carefully reviewed the collected data, screening it for inappropriate content. The next step involved segmenting the data into individual sentences, with each sentence labelled with a domain identifier. We also paid close attention to grammar and spelling accuracy and removed any duplicate sentences.

Kazakh-Russian code-switching is a common practice in Kazakhstan, so we took steps to maintain uniformity. For sentences containing both Kazakh and Russian words, we initiated a modification process. This process involved translating the Russian elements into Kazakh while preserving the intended meaning of the sentences.

Data Pre-Processing 🧹

We organised the data into language pairs. We then carefully removed any unwanted characters and effectively replaced homoglyphs. We also took care of formatting issues by eliminating line breaks (\n) and carriage returns (\r). We identified and removed duplicate entries, making sure to filter out rows with identical text in both language columns. However, to make our corpus more diverse and include a broader range of synonyms for different words and expressions, we decided to keep lines with duplicate text within a single language column.

In the table below, you will find statistics regarding the language pairs present in our corpus. The column labelled '# lines' shows the total number of rows for each language pair. In the columns labelled '# sents', '# tokens', and '# types', we provide counts of unique sentences, tokens, and word types for each language pair. For these counts, the upper numbers correspond to the first language in the pair, and the lower numbers correspond to the second language. These token and type counts were determined after processing the data using Moses Tokenizer 1.2.1.

Pair # lines # sents # tokens # types
KK↔EN 363,594 362,230
361,087
4,670,789
6,393,381
184,258
59,062
KK↔RU 363,482 362,230
362,748
4,670,593
4,996,031
184,258
183,204
KK↔TR 362,150 362,230
361,660
4,668,852
4,586,421
184,258
175,145
EN↔RU 363,456 361,087
362,748
6,392,301
4,994,310
59,062
183,204
EN↔TR 362,392 361,087
361,660
6,380,703
4,579,375
59,062
175,145
RU↔TR 363,324 362,748
361,660
4,999,850
4,591,847
183,204
175,145

Data Splitting ✂️

We began by creating a test set. To do this, we employed a random selection process, carefully choosing 250 unique and non-repeating rows from each of the sources outlined in Domains. The remaining data were divided into pairs, following an 80/20 split, while ensuring that the distribution of domains was maintained within both the training and validation sets.

Pair Train Valid Test
# lines # sents # tokens # types # lines # sents # tokens # lines # lines # sents # tokens # lines
KK↔EN 290,877 286,958
286,197
3,693,263
5,057,687
164,766
54,311
72,719 72,426
72,403
920,482
1,259,827
83,057
32,063
4,750 4,750
4,750
57,044
75,867
17,475
9,729
KK↔RU 290,785 286,943
287,215
3,689,799
3,945,741
164,995
165,882
72,697 72,413
72,439
923,750
988,374
82,958
87,519
4,750 4,750
4,750
57,044
61,916
17,475
18,804
KK↔TR 289,720 286,694
286,279
3,691,751
3,626,361
164,961
157,460
72,430 72,211
72,190
920,057
904,199
82,698
80,885
4,750 4,750
4,750
57,044
55,861
17,475
17,284
EN↔RU 290,764 286,185
287,261
5,058,530
3,950,362
54,322
165,701
72,692 72,377
72,427
1,257,904
982,032
32,208
87,541
4,750 4,750
4,750
75,867
61,916
9,729
18,804
EN↔TR 289,913 285,967
286,288
5,048,274
3,621,531
54,224
157,369
72,479 72,220
72,219
1,256,562
901,983
32,269
80,838
4,750 4,750
4,750
75,867
55,861
9,729
17,284
RU↔TR 290,899 287,241
286,475
3,947,809
3,626,436
165,482
157,470
72,725 72,455
72,362
990,125
909,550
87,831
80,962
4,750 4,750
4,750
61,916
55,861
18,804
17,284

Synthetic Corpus 🧪

To make our parallel corpus more extensive and diverse and to explore how well our translation models perform when dealing with a combination of human-translated and machine-translated content, we carried out web crawling to gather a total of 1,797,066 sentences from English-language websites. These sentences were then automatically translated into Kazakh, Russian, and Turkish using the Google Translate service. In the context of our research, we refer to this collection of data as 'SynC' (Synthetic Corpus).

Pair #
lines
#
sents
#
tokens
#
types
KK↔EN 1,787,050 1,782,192
1,781,019
26,630,960
35,291,705
685,135
300,556
KK↔RU 1,787,448 1,782,192
1,777,500
26,654,195
30,241,895
685,135
672,146
KK↔TR 1,791,425 1,782,192
1,782,257
26,726,439
27,865,860
685,135
656,294
EN↔RU 1,784,513 1,781,019
1,777,500
35,244,800
30,175,611
300,556
672,146
EN↔TR 1,788,564 1,781,019
1,782,257
35,344,188
27,806,708
300,556
656,294
RU↔TR 1,788,027 1,777,500
1,782,257
30,269,083
27,816,210
672,146
656,294

We further divided the synthetic corpus into training and validation sets with a 90/10 ratio.

Pair Train Valid
# lines # sents # tokens # types # lines # sents # tokens # types
KK↔EN 1,608,345 1,604,414
1,603,426
23,970,260
31,767,617
650,144
286,372
178,705 178,654
178,639
2,660,700
3,524,088
208,838
105,517
KK↔RU 1,608,703 1,604,468
1,600,643
23,992,148
27,221,583
650,170
642,604
178,745 178,691
178,642
2,662,047
3,020,312
209,188
235,642
KK↔TR 1,612,282 1,604,793
1,604,822
24,053,671
25,078,688
650,384
626,724
179,143 179,057
179,057
2,672,768
2,787,172
209,549
221,773
EN↔RU 1,606,061 1,603,199
1,600,372
31,719,781
27,158,101
286,645
642,686
178,452 178,419
178,379
3,525,019
3,017,510
104,834
235,069
EN↔TR 1,609,707 1,603,636
1,604,545
31,805,393
25,022,782
286,387
626,740
178,857 178,775
178,796
3,538,795
2,783,926
105,641
221,372
RU↔TR 1,609,224 1,600,605
1,604,521
27,243,278
25,035,274
642,797
626,587
178,803 178,695
178,750
3,025,805
2,780,936
235,970
221,792

Data Vectorisation 🧮

The data underwent vectorisation using HuggingFace's transformers and datasets libraries. Each language pair was vectorised individually based on the source and target languages within the pair. Subsequently, the vectorised data sets were combined into unified training and validation sets, each comprising 6 language pairs for bidirectional translation purposes. For more details, see data_tokenization.ipynb.

Corpus Structure 🗂️

The corpus is organized into two distinct groups based on their file prefixes. Files "01" through "19" have the "kazparc" prefix, while Files "20" to "32" have the "sync" prefix.

├── kazparc
   ├── 01_kazparc_all_entries.csv
   ├── 02_kazparc_train_kk_en.csv
   ├── 03_kazparc_train_kk_ru.csv
   ├── 04_kazparc_train_kk_tr.csv
   ├── 05_kazparc_train_en_ru.csv
   ├── 06_kazparc_train_en_tr.csv
   ├── 07_kazparc_train_ru_tr.csv
   ├── 08_kazparc_valid_kk_en.csv
   ├── 09_kazparc_valid_kk_ru.csv
   ├── 10_kazparc_valid_kk_tr.csv
   ├── 11_kazparc_valid_en_ru.csv
   ├── 12_kazparc_valid_en_tr.csv
   ├── 13_kazparc_valid_ru_tr.csv
   ├── 14_kazparc_test_kk_en.csv
   ├── 15_kazparc_test_kk_ru.csv
   ├── 16_kazparc_test_kk_tr.csv
   ├── 17_kazparc_test_en_ru.csv
   ├── 18_kazparc_test_en_tr.csv
   ├── 19_kazparc_test_ru_tr.csv
├── sync
   ├── 20_sync_all_entries.csv
   ├── 21_sync_train_kk_en.csv
   ├── 22_sync_train_kk_ru.csv
   ├── 23_sync_train_kk_tr.csv
   ├── 24_sync_train_en_ru.csv
   ├── 25_sync_train_en_tr.csv
   ├── 26_sync_train_ru_tr.csv
   ├── 27_sync_valid_kk_en.csv
   ├── 28_sync_valid_kk_ru.csv
   ├── 29_sync_valid_kk_tr.csv
   ├── 30_sync_valid_en_ru.csv
   ├── 31_sync_valid_en_tr.csv
   ├── 32_sync_valid_ru_tr.csv

KazParC files:

  • File "01" contains the original, unprocessed text data for the four languages considered within KazParC.
  • Files "02" through "19" represent pre-processed texts divided into language pairs for training (Files "02" to "07"), validation (Files "08" to "13"), and testing (Files "14" to "19"). Language pairs are indicated within the filenames using two-letter language codes (e.g., kk_en).

SynC files:

  • File "20" contains raw, unprocessed text data for the four languages.
  • Files "21" to "32" contain pre-processed text divided into language pairs for training (Files "21" to "26") and validation (Files "27" to "32") purposes.

In both "01" and "20", each line consists of specific components: a unique line identifier (id), texts in Kazakh (kk), English (en), Russian (ru), and Turkish (tr), along with accompanying domain information (domain). For the other files, the data fields are id, source_lang, target_lang, domain, and the language pair (e.g., kk_en.).

Experimental Setup 🔬

In our study, we used Facebook's NLLB model, which supports translation for a wide range of languages, including Kazakh, English, Russian, and Turkish. To assess the performance of the model, we initially tested two versions: the baseline and the distilled models. We fine-tuned these versions on KazParC data. After comparing their results, we found that the distilled model consistently outperformed the baseline, though the difference was relatively small, with an improvement of just 0.01 BLEU score. Consequently, we focused our subsequent experiments exclusively on fine-tuning the distilled model.

We trained a total of four models:

  1. 'base', the off-the-shelf model.
  2. 'parc', fine-tuned on KazParC data.
  3. 'sync', fine-tuned on SynC data.
  4. 'parsync', fine-tuned on both KazParC and SynC data.

We fine-tuned these models using hyperparameters tuned with validation sets. We included synthetic data in the validation sets only when assessing the performance of the 'sync' and 'parsync' models. The best-performing models were then evaluated on the test sets.

In addition to the KazParC test set, we used the FLoRes dataset. We merged the dev and devtest sets from FLoRes into one set for our evaluation. We also explored language pairs, such as German-French, German-Ukrainian, and French-Uzbek, to assess how fine-tuning the model affected translation quality for different language pairs.

All the models were fine-tuned using eight GPUs on an NVIDIA DGX A100 machine. We initially set a learning rate of 2 × 10-5 and used the AdaFactor optimization algorithm. The training process spanned three epochs, with both the training and evaluation batch sizes set to 8. To start training the model, create a virtual environment and install the necessary requirements from the environment.yaml file:

conda create --name kazparc python=3.8.17
conda env update --name kazparc --file environment.yaml

Once you have completed the above steps, you are ready to run the train.py script using the command:

python3 -m torch.distributed.launch --nproc_per_node 8 --nnodes 1 train.py

Evaluation Metrics 📏

In our evaluation of machine translation models, we used two widely recognised metrics:

  1. BLEU, based on precision in 4-grams, measures how closely machine-produced translations match human references.
  2. chrF evaluates translation quality by considering character n-grams, making it well-suited for languages with complex morphologies (e.g., Kazakh and Turkish). chrF calculates the harmonic mean of character-based precision and recall, offering a robust evaluation of translation performance.
Both BLEU and chrF scores range from 0 to 1, where higher scores indicate better translation quality.

Experiment Results 📈

We translated the test dataset using the translate_test_set.py script. To obtain the BLEU and ChrF metrics we used evaluation.ipynb. Below are the results we obtained from evaluating the Tilmash model on the KazParC and FLoRes test datasets.

Pair FLoRes Test Set
base parc sync parsync Yandex Google
EN↔KK 0.11 | 0.49 0.14 | 0.56 0.20 | 0.60 0.20 | 0.60 0.18 | 0.58 0.20 | 0.60
EN↔RU 0.25 | 0.56 0.26 | 0.58 0.28 | 0.60 0.28 | 0.60 0.32 | 0.63 0.31 | 0.62
EN↔TR 0.19 | 0.58 0.22 | 0.61 0.27 | 0.65 0.27 | 0.65 0.29 | 0.66 0.30 | 0.66
KK↔EN 0.28 | 0.59 0.32 | 0.62 0.31 | 0.62 0.32 | 0.63 0.30 | 0.62 0.36 | 0.65
KK↔RU 0.15 | 0.49 0.17 | 0.51 0.18 | 0.52 0.18 | 0.52 0.18 | 0.52 0.20 | 0.53
KK↔TR 0.09 | 0.48 0.13 | 0.52 0.14 | 0.54 0.14 | 0.54 0.12 | 0.52 0.17 | 0.56
RU↔EN 0.31 | 0.62 0.32 | 0.63 0.32 | 0.63 0.32 | 0.63 0.33 | 0.64 0.35 | 0.65
RU↔KK 0.08 | 0.49 0.10 | 0.52 0.13 | 0.53 0.13 | 0.54 0.12 | 0.54 0.13 | 0.54
RU↔TR 0.10 | 0.49 0.12 | 0.52 0.14 | 0.54 0.14 | 0.54 0.13 | 0.54 0.17 | 0.56
TR↔EN 0.34 | 0.64 0.35 | 0.65 0.36 | 0.66 0.36 | 0.66 0.38 | 0.67 0.39 | 0.67
TR↔KK 0.07 | 0.45 0.10 | 0.51 0.13 | 0.54 0.13 | 0.54 0.12 | 0.53 0.13 | 0.54
TR↔RU 0.15 | 0.48 0.17 | 0.51 0.18 | 0.52 0.19 | 0.53 0.20 | 0.54 0.21 | 0.54
Average 0.18 | 0.53 0.20 | 0.56 0.22 | 0.58 0.22 | 0.58 0.23 | 0.58 0.25 | 0.59

BLEU | chrF scores for models on the FLoRes test

Pair KazParC Test Set
base parc sync parsync Yandex Google
EN↔KK 0.12 | 0.51 0.18 | 0.58 0.18 | 0.58 0.21 | 0.60 0.18 | 0.58 0.30 | 0.65
EN↔RU 0.31 | 0.64 0.38 | 0.68 0.35 | 0.66 0.38 | 0.68 0.39 | 0.70 0.41 | 0.71
EN↔TR 0.19 | 0.59 0.22 | 0.62 0.25 | 0.63 0.25 | 0.64 0.27 | 0.64 0.34 | 0.68
KK↔EN 0.24 | 0.55 0.33 | 0.62 0.24 | 0.57 0.32 | 0.62 0.28 | 0.60 0.31 | 0.62
KK↔RU 0.22 | 0.56 0.29 | 0.63 0.24 | 0.59 0.29 | 0.63 0.29 | 0.63 0.29 | 0.61
KK↔TR 0.10 | 0.47 0.15 | 0.54 0.14 | 0.52 0.16 | 0.55 0.13 | 0.52 0.23 | 0.59
RU↔EN 0.34 | 0.63 0.43 | 0.71 0.34 | 0.65 0.42 | 0.70 0.43 | 0.71 0.42 | 0.71
RU↔KK 0.15 | 0.55 0.21 | 0.61 0.18 | 0.58 0.22 | 0.62 0.23 | 0.62 0.24 | 0.62
RU↔TR 0.11 | 0.49 0.16 | 0.56 0.16 | 0.55 0.18 | 0.57 0.16 | 0.55 0.22 | 0.60
TR↔EN 0.31 | 0.61 0.38 | 0.67 0.32 | 0.63 0.38 | 0.66 0.36 | 0.66 0.37 | 0.66
TR↔KK 0.08 | 0.46 0.14 | 0.53 0.14 | 0.52 0.16 | 0.55 0.14 | 0.53 0.19 | 0.57
TR↔RU 0.17 | 0.50 0.23 | 0.56 0.20 | 0.54 0.24 | 0.57 0.23 | 0.57 0.26 | 0.58
Average 0.20 | 0.55 0.27 | 0.61 0.23 | 0.59 0.27 | 0.62 0.26 | 0.61 0.30 | 0.63

BLEU | chrF scores for models on the KazParC test

After a comprehensive analysis of both qualitative and quantitative outcomes, we have found that the 'parsync' model, which was fine-tuned on a mix of the KazParC corpus and synthetic data, emerged as the top-performing model. Let us simply call this model Tilmash, a Kazakh term that means 'interpreter' or 'translator'.

Pair BLEU chrF
base Tilmash base Tilmash
DE→FR 0.33 0.28 0.61 0.58
FR→DE 0.22 0.19 0.55 0.53
DE→UK 0.15 0.04 0.49 0.36
UK→DE 0.19 0.16 0.53 0.50
FR→UZ 0.06 0.02 0.48 0.31
UZ→FR 0.25 0.22 0.56 0.53

Results of the base and Tilmash models on the control language pairs on the FLoRes test set

Pair Type Text BLEU chrF
KK→EN source Ыстық және желді.
Ystyq jane jeldi.
reference It is hot and windy. 1.00 1.00
Tilmash It's hot and windy. 0.55 0.81
Yandex Hot and windy. 0.00 0.66
Google Hot and windy. 0.00 0.66
KK→EN source 1 қыркүйекте бесінші ана өлімі тіркелді.
1 qyrkuiekte besinshi ana olimi tirkeldi.
reference On September 1, the fifth maternal death was registered. 1.00 1.00
Tilmash A fifth maternal death was recorded on 1 September. 0.27 0.63
Yandex On September 1, the fifth maternal death was registered. 1.00 1.00
Google On September 1, the fifth maternal death was recorded. 0.81 0.86

A selection of translation outputs from Tilmash, Yandex, and Google

Below are the detailed tables of Tilmash, Yandex, and Google results per domain.

EDUCATION AND SCIENCE
Pair Tilmash Yandex Google
BLEU chrF BLEU chrF BLEU chrF
EN→KK 0.23 0.63 0.19 0.61 0.44 0.73
EN→RU 0.39 0.74 0.39 0.76 0.43 0.78
EN→TR 0.33 0.71 0.37 0.74 0.47 0.79
KK→EN 0.28 0.64 0.27 0.63 0.32 0.66
KK→RU 0.26 0.66 0.26 0.66 0.32 0.66
KK→TR 0.20 0.60 0.15 0.57 0.29 0.66
RU→EN 0.38 0.73 0.40 0.75 0.40 0.76
RU→KK 0.21 0.64 0.22 0.65 0.30 0.67
RU→TR 0.24 0.65 0.22 0.65 0.33 0.70
TR→EN 0.38 0.70 0.38 0.70 0.40 0.71
TR→KK 0.19 0.58 0.17 0.56 0.29 0.64
TR→RU 0.27 0.63 0.29 0.65 0.33 0.68
FICTION
Pair Tilmash Yandex Google
BLEU chrF BLEU chrF BLEU chrF
EN→KK 0.13 0.51 0.15 0.52 0.19 0.53
EN→RU 0.35 0.64 0.34 0.66 0.37 0.66
EN→TR 0.28 0.62 0.29 0.63 0.53 0.74
KK→EN 0.29 0.57 0.24 0.54 0.29 0.58
KK→RU 0.25 0.58 0.23 0.55 0.25 0.57
KK→TR 0.26 0.62 0.18 0.56 0.50 0.77
RU→EN 0.40 0.66 0.41 0.67 0.42 0.68
RU→KK 0.17 0.55 0.19 0.56 0.16 0.55
RU→TR 0.22 0.59 0.17 0.55 0.36 0.67
TR→EN 0.36 0.63 0.35 0.62 0.37 0.64
TR→KK 0.15 0.55 0.16 0.55 0.19 0.58
TR→RU 0.24 0.56 0.24 0.56 0.26 0.58
GENERAL
Pair Tilmash Yandex Google
BLEU chrF BLEU chrF BLEU chrF
EN→KK 0.26 0.68 0.17 0.62 0.45 0.77
EN→RU 0.46 0.76 0.44 0.77 0.48 0.79
EN→TR 0.12 0.54 0.12 0.54 0.12 0.55
KK→EN 0.39 0.68 0.29 0.64 0.33 0.65
KK→RU 0.32 0.68 0.29 0.66 0.30 0.66
KK→TR 0.10 0.52 0.08 0.47 0.11 0.51
RU→EN 0.45 0.74 0.39 0.71 0.38 0.70
RU→KK 0.22 0.66 0.18 0.63 0.22 0.65
RU→TR 0.11 0.52 0.09 0.49 0.09 0.51
TR→EN 0.32 0.62 0.27 0.59 0.28 0.60
TR→KK 0.14 0.55 0.10 0.50 0.16 0.56
TR→RU 0.22 0.57 0.18 0.57 0.21 0.58
LEGAL DOCUMENTS
Pair Tilmash Yandex Google
BLEU chrF BLEU chrF BLEU chrF
EN→KK 0.27 0.67 0.28 0.67 0.29 0.68
EN→RU 0.48 0.75 0.46 0.76 0.47 0.76
EN→TR 0.22 0.64 0.23 0.64 0.25 0.55
KK→EN 0.41 0.69 0.34 0.65 0.36 0.66
KK→RU 0.47 0.77 0.45 0.76 0.38 0.71
KK→TR 0.11 0.54 0.11 0.53 0.13 0.54
RU→EN 0.52 0.76 0.52 0.76 0.51 0.76
RU→KK 0.37 0.74 0.38 0.75 0.33 0.71
RU→TR 0.14 0.57 0.13 0.56 0.15 0.58
TR→EN 0.46 0.72 0.39 0.69 0.43 0.70
TR→KK 0.18 0.58 0.15 0.56 0.18 0.58
TR→RU 0.29 0.63 0.22 0.59 0.27 0.61
MASS MEDIA
Pair Tilmash Yandex Google
BLEU chrF BLEU chrF BLEU chrF
EN→KK 0.18 0.58 0.17 0.58 0.19 0.59
EN→RU 0.35 0.67 0.38 0.70 0.40 0.70
EN→TR 0.30 0.66 0.31 0.67 0.41 0.72
KK→EN 0.32 0.62 0.32 0.62 0.33 0.62
KK→RU 0.27 0.61 0.29 0.62 0.26 0.59
KK→TR 0.18 0.57 0.16 0.55 0.26 0.62
RU→EN 0.48 0.73 0.53 0.76 0.50 0.74
RU→KK 0.21 0.60 0.22 0.62 0.20 0.59
RU→TR 0.22 0.60 0.18 0.58 0.26 0.63
TR→EN 0.40 0.68 0.40 0.68 0.41 0.69
TR→KK 0.15 0.55 0.14 0.54 0.17 0.57
TR→RU 0.22 0.57 0.24 0.58 0.25 0.59

Using Tilmash 🚀

To translate text, you can utilise the predict.py script. To get started, make sure to download Tilmash from our Hugging Face repository. In the script, you will need to specify the source and target languages using the src and trg variables. You can choose from the following language values:

  • Kazakh: kaz_Cyrl
  • Russian: rus_Cyrl
  • English: eng_Latn
  • Turkish: tur_Latn

Once you have set the languages, simply input the text you want to translate into the text variable.

Acknowledgements 🙏

We wish to convey our deep appreciation to the diligent group of translators whose exceptional contributions have been crucial to the successful realisation of this study. Their tireless efforts to ensure the accuracy and faithful rendition of the source materials have indeed proved invaluable. Our sincerest thanks go to the following esteemed individuals: Aigerim Baidauletova, Aigerim Boranbayeva, Ainagul Akmuldina, Aizhan Seipanova, Askhat Kenzhegulov, Assel Kospabayeva, Assel Mukhanova, Elmira Nikiforova, Gaukhar Rayanova, Gulim Kabidolda, Gulzhanat Abduldinova, Indira Yerkimbekova, Moldir Orazalinova, Saltanat Kemaliyeva, and Venera Spanbayeva.

Citation 🎓

We kindly urge you, if you incorporate our dataset and/or model into your work, to cite our paper as a gesture of recognition for its valuable contribution. The act of referencing the relevant sources not only upholds academic honesty but also ensures proper acknowledgement of the authors' efforts. Your citation in your research significantly contributes to the continuous progress and evolution of the scholarly realm. Your endorsement and acknowledgement of our endeavours are genuinely appreciated.

@misc{yeshpanov2024kazparc,
      title={KazParC: Kazakh Parallel Corpus for Machine Translation}, 
      author={Rustem Yeshpanov and Alina Polonskaya and Huseyin Atakan Varol},
      year={2024},
      eprint={2403.19399},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

About

An open-source parallel corpus for machine translation across Kazakh, English, Russian, and Turkish


Languages

Language:Jupyter Notebook 91.0%Language:Python 9.0%