cartesinus / on-quality-estimation

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Overview

This repository allow build models for machine translation (MT) quality estimation (QE). It is clearly Quest++ rip off that I made in order to experiment with 'before BERT' QE.

The data :

  • English-German WMT18 sentences on the IT domain translated by in-house encoder-decoder attention-based NMT system (13,442 training and 1,000 development sentences)
  • After running ./scripts/download-data.sh data will be downloaded to data/sentence-level/features/en_de.
  • The usual 17 features used in WMT12-17 is considered for the baseline system
  • WMT18 QE baseline model was SVM regression with an RBF kernel, with grid search algorithm for the optimisation of relevant parameters. I tried to reproduce this in config/svc.cfg

Train model

The program takes as an input; method, config file and additional parameters.

For example, to train model:

./quality_estimation.py --train --config config/svc.yaml

Preparing training corpora

To extract features from tsv file (needed columnt: src and trg):

./qulity_estimation.py --extract_features \
                       --src_lm_path data/lm.tok.en \
                       --trg_lm_path data/lm.tok.de \
                       --trg_ncount_path data/ngram-count.de \
                       -i input.tsv -i output.tsv

also remember to provide SRILM path either with export SRILM_PATH or by --srilm_path.

Available learning methods

All of available methods are taken from sklearn, so it is fairly easey to add other as well, but currently these are "supported":

Feature selection

To set up a feature selection algorithm add the "feature_selection" section to the configuration file. This section is independent of the "learning" section:

feature_selection:
    method: LinearSVC
    parameters:
        cv: 10

learning:
    ...

Currently, the following feature selection algorithms are available:

  • Linear Support Vector Classification. The exposed parameters are:
    • penalty (default=’l2’)
    • loss (default=’squared_hinge’)
    • dual (default=True)
    • tol (default=1e-4)
    • C (default=1.0)
    • fit_intercept (default=True)
    • intercept_scaling (default=1)
    • max_iterint (default=1000)

These parameters and the method are documented at: https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html

Inference

To inference model on given input:

./quality_estimation.py --inference --config config/svc.yaml --input test.tsv

About


Languages

Language:Python 99.3%Language:Shell 0.7%