The aim of this project is to utilize the TRIE data structure to automatically record and identify affixation (primarily suffixation) patterns from a given piece of text. We make use of these patterns to attempt to improve tokenization. Further, we apply the novel tokenizer on Hindi and Telugu data to train a Machine Translation (MT) pipeline, wherein we compare the performance against well-known tokenization methods.
Python >= 3.8
pip install --editable .
Fetching the data
$ cd TRIEceps/
$ gdown https://drive.google.com/uc?id=1WHE0IgHm_oFNW3X1C_Ax0AeLozGPLREt
$ unzip data.zip
Training MT pipeline on SentencePiece Baseline
$ bash train_mt.sh sentencepiece_bpe
Training MT pipeline on TRIEceps
$ bash train_mt.sh trieceps_candidacy
The processed data, checkpoints, and test outputs will be saved at ./data/models/{model_type}/
The pretrained checkpoints and processed data can be found here. Evaluation files can be found here.