HPLT - High Performance Language Technologies's repositories
sacremoses
Python port of Moses tokenizer, truecaser and normalizer
OpusCleaner
OpusCleaner is a web interface that helps you select, clean and schedule your data for training machine translation models.
OpusTrainer
Curriculum training
monolingual-multilingual-instruction-tuning
Monolingual or Multilingual Instruction Tuning: Which Makes a Better Alpaca
data-analytics-tool
Data Analytics Tool
HPLT-MT-Models
This contains the configuration and scripts for HPLT MT model releases.
warc2text-runner
Scripts for parallelized extraction of plain texts from WARC archieves. Aiming at common and reproducible extraction approach.
ia-download
Internet archive downloader
monotextor-slurm
Set of scripts to run monotextor-like pipeline under slurm HPCs
document-aligner
tf/idf-based document aligner from Bitextor
clianer
A lightweight command-line frontend to OpusCleaner
OpusFilter
OpusFilter - Parallel corpus processing toolkit
paracrawl-dashboard
Make-shift interface for managing Paracrawl processing and exploring its outputs