Overview
This repository allow build models for machine translation (MT) quality estimation (QE). It is clearly Quest++ rip off that I made in order to experiment with 'before BERT' QE.
The data :
- English-German WMT18 sentences on the IT domain translated by in-house encoder-decoder attention-based NMT system (13,442 training and 1,000 development sentences)
- After running
./scripts/download-data.sh
data will be downloaded todata/sentence-level/features/en_de
. - The usual 17 features used in WMT12-17 is considered for the baseline system
- WMT18 QE baseline model was SVM regression with an RBF kernel, with grid search algorithm for the optimisation of relevant parameters. I tried to reproduce this in
config/svc.cfg
Train model
The program takes as an input; method, config file and additional parameters.
For example, to train model:
./quality_estimation.py --train --config config/svc.yaml
Preparing training corpora
To extract features from tsv file (needed columnt: src and trg):
./qulity_estimation.py --extract_features \
--src_lm_path data/lm.tok.en \
--trg_lm_path data/lm.tok.de \
--trg_ncount_path data/ngram-count.de \
-i input.tsv -i output.tsv
also remember to provide SRILM path either with export SRILM_PATH
or by --srilm_path
.
Available learning methods
All of available methods are taken from sklearn, so it is fairly easey to add other as well, but currently these are "supported":
- Support-Vector Machines (SVM). Documentation about classifier parameters is available at
http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC. The parameters exposed in the "Parameters" section of the configuration file are:
- C
- coef0
- kernel
- degree
- gamma
- tol
- verbose
- Decision Trees (DT). Documentation about classifier parameters is available at https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html. The parameters exposed in the "Parameters" section of the configuration file are:
- ccp_alpha
- criterion
- Multilayer Perceptron (MLP). Documentation about classifier parameters is available at https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html. The parameters exposed in the "Parameters" section of the configuration file are:
- activation
- alpha
- epsilon
- hidden_layer_sizes
- learning_rate_init
- max_iter -momentum
Feature selection
To set up a feature selection algorithm add the "feature_selection" section to the configuration file. This section is independent of the "learning" section:
feature_selection:
method: LinearSVC
parameters:
cv: 10
learning:
...
Currently, the following feature selection algorithms are available:
- Linear Support Vector Classification. The exposed parameters are:
- penalty (default=’l2’)
- loss (default=’squared_hinge’)
- dual (default=True)
- tol (default=1e-4)
- C (default=1.0)
- fit_intercept (default=True)
- intercept_scaling (default=1)
- max_iterint (default=1000)
These parameters and the method are documented at: https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html
Inference
To inference model on given input:
./quality_estimation.py --inference --config config/svc.yaml --input test.tsv