Data leakage issue in evaluation?

Question

Data leakage issue in evaluation?

datquocnguyen opened this issue 3 years ago · comments

Hi team @lmthang @thtrieu @heraclex12 @hqphat @KienHuynh

The obtained results of a Transformer-based model on the PhoMT test set surprised me. My first thought was that as VietAI and PhoMT datasets have several overlapping domains (e.g. Wikihow, TED talks, Opensubtitles, news..): whether there might be a potential data leakage issue in your evaluation (e.g. PhoMT English-Vietnamese test pairs appear in the VietAI training set)?

In particular, we find that 6294/19151 PhoMT English-Vietnamese test pairs appear in the VietAI training set (v2). When evaluating your model on the PhoMT test set, did you guys retrain the model on a VietAI training set variant that does not contain PhoMT English-Vietnamese test pairs?

Cheers,
Dat.

trieu · Answer 1 · Thu Mar 03 2022 19:53:16 GMT+0800 (China Standard Time)

Hi @datquocnguyen,

We only filtered for IWSLT'15 and did miss out on this!
We are now rolling back results for PhoMT's test set, and will update with new results later on.

Apologies for any inconvenience and thanks a lot for the good catch!
Trieu.

Nguyen Tran · Answer 2 · Fri Mar 04 2022 03:17:34 GMT+0800 (China Standard Time)

FYI, here is the source code that we use to count the number of duplicate pairs between the PhoMT test set and the VietAI training set:

import re
import unicodedata

en_path_ref = '/Absolute-path-to-VietAI-dataset/best_vi_translation_v2_train.en'
vi_path_ref = '/Absolute-path-to-VietAI-dataset/best_vi_translation_v2_train.vi'
en_path = '/Absolute-path-to-PhoMT-detokenization-dataset/test.en'
vi_path = '/Absolute-path-to-PhoMT-detokenization-dataset/test.vi'

vietnamese_tones_map = { "òa": "oà", "Òa": "Oà", "ÒA": "OÀ", "óa": "oá", "Óa": "Oá", "ÓA": "OÁ", "ỏa": "oả", "Ỏa": "Oả", "ỎA": "OẢ", "õa": "oã", "Õa": "Oã", "ÕA": "OÃ", "ọa": "oạ", "Ọa": "Oạ", "ỌA": "OẠ", "òe": "oè", "Òe": "Oè", "ÒE": "OÈ", "óe": "oé", "Óe": "Oé", "ÓE": "OÉ", "ỏe": "oẻ", "Ỏe": "Oẻ", "ỎE": "OẺ", "õe": "oẽ", "Õe": "Oẽ", "ÕE": "OẼ", "ọe": "oẹ", "Ọe": "Oẹ", "ỌE": "OẸ", "ùy": "uỳ", "Ùy": "Uỳ", "ÙY": "UỲ", "úy": "uý", "Úy": "Uý", "ÚY": "UÝ", "ủy": "uỷ", "Ủy": "Uỷ", "ỦY": "UỶ", "ũy": "uỹ", "Ũy": "Uỹ", "ŨY": "UỸ", "ụy": "uỵ", "Ụy": "Uỵ", "ỤY": "UỴ", }
punctuation = r"""!"#&'()*+,-./:;<=>?@[\]^_'`{|}~"""
def normalize(text, lang='en'):
    text = unicodedata.normalize('NFC', text)
    if lang=='vi':
      for i, j in vietnamese_tones_map.items():
          text = text.replace(i, j)
    text = text.strip().replace(" n't ", "n't ")
    new = [token.strip(punctuation) for token in text.lower().split()]
    return " ".join(" ".join(new).split())

def read_file(file_path, lang):
    lines = []
    with open(file_path) as f:
        for line in f:
            lines.append(normalize(line, lang))
    return lines

def remove_dup(en_path_ref, vi_path_ref, en_path, vi_path):
    en_sents_ref = read_file(en_path_ref, 'en')
    vi_sents_ref = read_file(vi_path_ref, 'vi')
    en_sents = read_file(en_path, 'en')
    vi_sents = read_file(vi_path, 'vi')
    removed_indices = []
    concatenated_sents_ref = set()
    for en_sent, vi_sent in zip(en_sents_ref, vi_sents_ref):
        concatenated_sents_ref.add(en_sent + ' ' + vi_sent)
    for index, en_sent in enumerate(en_sents):
        key = en_sent + ' ' + vi_sents[index]
        if key in concatenated_sents_ref:
            removed_indices.append(index)
    return removed_indices

removed_indices = remove_dup(en_path_ref, vi_path_ref, en_path, vi_path)
print('Total duplicate pairs: ', len(removed_indices))

Thang Luong · Answer 3 · Sat Mar 05 2022 10:00:50 GMT+0800 (China Standard Time)

Thanks for the feedback @datquocnguyen and @tranluongnguyen25. We'll revisit the evaluation to make it fair. Our main goal has always been to to build on top of good works from others so we make faster progress as a community.