donglixp / wmt-trueskill

Data and code used in the 2014 WMT, "Efficient Elicitation of Annotations for Human Evaluation of Machine Translation"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

TrueSkill for WMT

Source code used in 2014 WMT paper, "Efficient Elicitation of Annotations for Human Evaluation of Machine Translation"

  • Keisuke Sakaguchi (keisuke[at]cs.jhu.edu)
  • Matt Post
  • Benjamin Van Durme

Last updated: June 12th, 2015


This document describes the proposed method described in the following paper:

@InProceedings{sakaguchi-post-vandurme:2014:W14-33,
  author    = {Sakaguchi, Keisuke  and  Post, Matt  and  Van Durme, Benjamin},
  title     = {Efficient Elicitation of Annotations for Human Evaluation of Machine Translation},
  booktitle = {Proceedings of the Ninth Workshop on Statistical Machine Translation},
  month     = {June},
  year      = {2014},
  address   = {Baltimore, Maryland, USA},
  publisher = {Association for Computational Linguistics},
  pages     = {1--11},
  url       = {http://www.aclweb.org/anthology/W14-3301}
}

Prerequisites (and optional) python modules:

Example Procedure:

    1. Preprocessing: converting an xml file (from Appraise) to a csv file.
    • mkdir result if not exist.
    • cd data
    • python xml2csv.py ABC.xml
    • The xml/csv file must consist of a single language pair.
    1. Training: run python infer_TS.py (TrueSkill) in the src directory.
    • cd ../src
    • cat ../data/ABC.csv |python infer_TS.py ../result/ABC -n 2 -d 0 -s 2
    • for more details: python infer_TS.py --help
    • You can change other parameters in infer_TS.py, if needed.
    • For clustering (i.e. grouped ranking), we need to execute multiple runs of infer_TS.py (100+ is recommended) for each language pair (e.g. fr-en from fr-en0 to fr-en99).
    • You will get the result named OUT_ID_mu_sigma.json in the result directory
    • For using Expected Win, run python infer_EW.py -s 2 ../result/ABC
    1. To see the grouped ranking, run cluster.py in the eval directory.
    • cd ../eval
    • python cluster.py fr-en ../result/fr-en -n 100 -by-rank -pdf
    • for more details: python cluster.py --help
    • pdf option might cause RuntimeError, but please check if a pdf file is successfully generated.
    1. (optional) To tune decision radius in (accuracy), run tune_acc.py.
    • e.g. cat data/sample-fr-en-{dev|test}.csv |python src/eval_acc.py -d 0.1 -i result/fr-en0_mu_sigma.json
    1. (optional) To see the next systems to be compared, run python src/scripts/next_comparisons.py *_mu_sigma.json N
    • This outputs the next comparison under the current result mu and sigma (.json) for N free-for-all matches.

Questions and comments:

  • Please e-mail to Keisuke Sakaguchi (keisuke[at]cs.jhu.edu).

About

Data and code used in the 2014 WMT, "Efficient Elicitation of Annotations for Human Evaluation of Machine Translation"


Languages

Language:Python 96.4%Language:Perl 2.6%Language:Shell 1.0%