uncertainty_qe

Experiments to address uncertainty in Quality Estimation

Requirements

Python 2.7
Numpy
Matplotlib
scikit-learn (for SVMs)
GPy (warped_gp_fixes branch in beckdaniel's fork)

Data

3 datasets:

WMT14 English-Spanish
EAMT11 English-Spanish
EAMT11 French-English

Features are the 17 baseline Quest features. Response variables are post-editing time per word (check Graham(2015), ACL).

Experiments sketch

For each dataset, we perform experiments using the following ML models:

SVM + Bagging
GP RBF Kernel Isotropic
GP RBF Kernel ARD
GP Matern32 Kernel ARD
Warped GP Matern32 Kernel ARD

Intrinsic evaluation is made using the following metrics:

NLPD (only available for GPs)
MAE
MSE
Pearson's correlation measure (check Graham(2015), ACL)

Extrinsic evaluation is made using the following tasks:

Reject option setting
Active learning setting
?

Everything is done via cross-validation (5 folds as default but this can be changed).

Reject options

Main idea: ignore predicitons with high uncertainty (variance). We plot curves measuring intrinsic metrics (Pearson's?) according to the top N% most confident predictions. Ideal curves should be monotonic.

Active learning

The setting is similar to Beck et al. (2013, ACL). Start with a small (default: 50) set, measure error (Pearson's?) on test set and use active learning to incrementally increase the size of the training set. Error should eventually reach a plateau with very few sentences. An oracle setting is also available.

About

Experiments to address uncertainty in Quality Estimation

Languages

Language:Python 92.6%Language:Shell 7.4%