vietai / mTet

MTet: Multi-domain Translation for English and Vietnamese

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Data leakage issue in evaluation?

datquocnguyen opened this issue · comments

Hi team @lmthang @thtrieu @heraclex12 @hqphat @KienHuynh

The obtained results of a Transformer-based model on the PhoMT test set surprised me. My first thought was that as VietAI and PhoMT datasets have several overlapping domains (e.g. Wikihow, TED talks, Opensubtitles, news..): whether there might be a potential data leakage issue in your evaluation (e.g. PhoMT English-Vietnamese test pairs appear in the VietAI training set)?

In particular, we find that 6294/19151 PhoMT English-Vietnamese test pairs appear in the VietAI training set (v2). When evaluating your model on the PhoMT test set, did you guys retrain the model on a VietAI training set variant that does not contain PhoMT English-Vietnamese test pairs?

Cheers,
Dat.

commented

Hi @datquocnguyen,

We only filtered for IWSLT'15 and did miss out on this!
We are now rolling back results for PhoMT's test set, and will update with new results later on.

Apologies for any inconvenience and thanks a lot for the good catch!
Trieu.

FYI, here is the source code that we use to count the number of duplicate pairs between the PhoMT test set and the VietAI training set:

import re
import unicodedata

en_path_ref = '/Absolute-path-to-VietAI-dataset/best_vi_translation_v2_train.en'
vi_path_ref = '/Absolute-path-to-VietAI-dataset/best_vi_translation_v2_train.vi'
en_path = '/Absolute-path-to-PhoMT-detokenization-dataset/test.en'
vi_path = '/Absolute-path-to-PhoMT-detokenization-dataset/test.vi'

vietnamese_tones_map = { "òa": "oà", "Òa": "Oà", "ÒA": "OÀ", "óa": "oá", "Óa": "Oá", "ÓA": "OÁ", "ỏa": "oả", "Ỏa": "Oả", "ỎA": "OẢ", "õa": "oã", "Õa": "Oã", "ÕA": "OÃ", "ọa": "oạ", "Ọa": "Oạ", "ỌA": "OẠ", "òe": "oè", "Òe": "Oè", "ÒE": "OÈ", "óe": "oé", "Óe": "Oé", "ÓE": "OÉ", "ỏe": "oẻ", "Ỏe": "Oẻ", "ỎE": "OẺ", "õe": "oẽ", "Õe": "Oẽ", "ÕE": "OẼ", "ọe": "oẹ", "Ọe": "Oẹ", "ỌE": "OẸ", "ùy": "uỳ", "Ùy": "Uỳ", "ÙY": "UỲ", "úy": "uý", "Úy": "Uý", "ÚY": "UÝ", "ủy": "uỷ", "Ủy": "Uỷ", "ỦY": "UỶ", "ũy": "uỹ", "Ũy": "Uỹ", "ŨY": "UỸ", "ụy": "uỵ", "Ụy": "Uỵ", "ỤY": "UỴ", }
punctuation = r"""!"#&'()*+,-./:;<=>?@[\]^_'`{|}~"""
def normalize(text, lang='en'):
    text = unicodedata.normalize('NFC', text)
    if lang=='vi':
      for i, j in vietnamese_tones_map.items():
          text = text.replace(i, j)
    text = text.strip().replace(" n't ", "n't ")
    new = [token.strip(punctuation) for token in text.lower().split()]
    return " ".join(" ".join(new).split())

def read_file(file_path, lang):
    lines = []
    with open(file_path) as f:
        for line in f:
            lines.append(normalize(line, lang))
    return lines

def remove_dup(en_path_ref, vi_path_ref, en_path, vi_path):
    en_sents_ref = read_file(en_path_ref, 'en')
    vi_sents_ref = read_file(vi_path_ref, 'vi')
    en_sents = read_file(en_path, 'en')
    vi_sents = read_file(vi_path, 'vi')
    removed_indices = []
    concatenated_sents_ref = set()
    for en_sent, vi_sent in zip(en_sents_ref, vi_sents_ref):
        concatenated_sents_ref.add(en_sent + ' ' + vi_sent)
    for index, en_sent in enumerate(en_sents):
        key = en_sent + ' ' + vi_sents[index]
        if key in concatenated_sents_ref:
            removed_indices.append(index)
    return removed_indices

removed_indices = remove_dup(en_path_ref, vi_path_ref, en_path, vi_path)
print('Total duplicate pairs: ', len(removed_indices))

Thanks for the feedback @datquocnguyen and @tranluongnguyen25. We'll revisit the evaluation to make it fair. Our main goal has always been to to build on top of good works from others so we make faster progress as a community.