AI4Bharat / setu-translate

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Setu-Translate: A Large Scale Translation Pipeline

Setu-Translate uses IndicTrans2 (IT2) for performing large-scale translation across English and 22 Indic Languages.

Currently, we provide inference support for PyTorch and Flax versions of IT2. TPUs can be used for large-scale translation by leveraging Flax port of IT2.

Setu Translate Stages Overview

Table of Contents

  1. Overview
  2. Quickstart
  3. Usage

Overview

The Setu-Translate Pipeline contains 4 main stages:

  • Templating : Each dataset is input to the pipeline in parquet format. During this stage, each entry in the dataset is converted into a Document object format. During conversion additional steps such as text cleaning, chunking, remove duplicates, delimitter splitting, etc. are performed.

  • Global Sentence Dataset : During this stage, the templated datafiles are processed and formatted into a sentence level dataset based on doc_ids.

  • Binarize : During this stage, the sentences are processed using the IndicProcessor and IndicTransTokenizer based on the source and target language. Further we perform padding and save the output either in numpy (np) or pytorch (pt) format.

  • Translate The translation stage utilizes IndicTrans2 translation model to translate the English sentences to the corresponding target Indic languages. We provide support to run translation either on local or TPU cluster for larger datasets.

  • Decode The decode stages process the model output data and replaces the translated ids into their corresponding Indic Text and provides us with the translated text.

  • Replace During this stage, the translated words are appropriately replaced with the original text positions to maintain document structure. This depends on the output of the templating stage.

Quickstart

  1. Clone repository
git clone https://github.com/AI4Bharat/setu-translate.git
  1. Prepare environment
conda create -n translate-env python=3.10
conda activate translate-env
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
conda install -c conda-forge pyspark
conda install pip
pip install datasets transformers
  1. Install IndicTransTokenizer
cd IndicTransTokenizer

pip install --editable ./
  1. Install JAX and Setup for TPU

Based on your setup (local or TPU) download the appropriate JAX libraries accordingly from JAX Installation.

Also download the Flax Weights for IndicTrans2 and store it at setu-translate/stages/tlt_pipelines/flax/flax_weights/200m.

Usage

For a full run through using a sample subset of Wikipedia Dataset refer to the notebook. You can also run the stages individually using the below commands.

Templating Stage

HF_DATASETS_CACHE=/home/$USER/tmp python perform_templating.py \
    --glob_path "/home/$USER/setu-translate/examples/sample_data/wiki_en_data.parquet" \
    --cache_dir_for_original_data "/home/$USER/setu-translate/examples/cache" \
    --base_save_path "/home/$USER/setu-translate/examples/output/wiki_en/doc_csvs" \
    --save_path "/home/$USER/setu-translate/examples/output/wiki_en/templated" \
    --text_col body \
    --url_col url \
    --timestamp_col timestamp \
    --source_type wiki_en \
    --translation_type sentence \
    --use_cache False \
    --split "train[:5%]"

Global Sentence Dataset Stage

HF_DATASETS_CACHE=/home/$USER/tmp python create_global_ds.py \
    --paths_data "/home/$USER/setu-translate/examples/output/wiki_en/templated/*.arrow" \
    --cache_dir "/home/$USER/setu-translate/examples/cache" \
    --global_sent_ds_path "/home/$USER/setu-translate/examples/output/wiki_en/sentences"

Binarize Dataset Stage

HF_DATASETS_CACHE=/home/$USER/tmp python binarize.py \
    --root_dir "/home/$USER/setu-translate" \
    --data_files "/home/$USER/setu-translate/examples/output/wiki_en/sentences/*.arrow" \
    --cache_dir "/home/$USER/setu-translate/examples/cache" \
    --binarized_dir "/home/$USER/setu-translate/examples/output/wiki_en/binarized_sentences" \
    --batch_size 2048 \
    --total_procs 1 \
    --padding max_length \
    --src_lang eng_Latn \
    --tgt_lang hin_Deva \
    --return_format np

Translate Stage

HF_DATASETS_CACHE=/home/$USER/tmp python tlt_pipelines/translate_joblib.py \
    --root_dir "/home/$USER/setu-translate" \
    --data_files "/home/$USER/setu-translate/examples/output/wiki_en/binarized_sentences/*.arrow" \
    --cache_dir "/home/$USER/setu-translate/examples/cache" \
    --base_save_dir "/home/$USER/setu-translate/examples/output/wiki_en/model_out" \
    --joblib_temp_folder "/home/$USER/setu-translate/tmp" \
    --batch_size 512 \
    --total_procs 1 \
    --devices "0"

Decode Stage

HF_DATASETS_CACHE=/home/$USER/tmp python decode.py \
    --root_dir "/home/$USER/setu-translate" \
    --data_files "/home/$USER/setu-translate/examples/output/wiki_en/model_out/*/*.arrow" \
    --cache_dir "/home/$USER/setu-translate/examples/cache" \
    --decode_dir "/home/$USER/setu-translate/examples/output/wiki_en/decode" \
    --batch_size 64 \
    --total_procs 1 \
    --src_lang eng_Latn \
    --tgt_lang hin_Deva \

Replace Stage

HF_DATASETS_CACHE=/home/$USER/tmp python replace.py \
    --paths_data "/home/$USER/setu-translate/examples/output/wiki_en/templated/*.arrow" \
    --cache_dir "/home/$USER/setu-translate/examples/cache" \
    --batch_size 64 \
    --num_procs 1 \
    --decode_base_path "/home/$USER/setu-translate/examples/output/wiki_en decode/*.arrow" \
    --translated_save_path "/home/$USER/setu-translate/examples/output/wiki_en/translated"

About

License:MIT License


Languages

Language:Python 95.1%Language:Shell 4.9%