the difference of your bleu and sacrebleu

Question

the difference of your bleu and sacrebleu

cooper12121 opened this issue 2 months ago · comments

What is the difference between your package's bleu implementation and sacrebleu implementation? I calculated the result differently in the two ways, Chinese expected, passed sacrebleu's zh tokenizer

Mathew Shen · Answer 1 · Wed Apr 24 2024 10:39:51 GMT+0800 (China Standard Time)

I believe there are some differences between the implementation and sacrebleu's. Actruly, testing with English has the same problem.

evaluate

import evaluate


predictions = ["hello there general kenobi", "foo bar foobar"]
references = [
    ["hello there general kenobi", "hello there !"],
    ["foo bar foobar"]
] 

bleu = evaluate.load("bleu")
results = bleu.compute(predictions=predictions, references=references, smooth=False, max_order=4)
print(results)

got results:

{'bleu': 1.0, 'precisions': [1.0, 1.0, 1.0, 1.0], 'brevity_penalty': 1.0, 'length_ratio': 1.1666666666666667, 'translation_length': 7, 'reference_length': 6}

sacrebleu

from sacrebleu.metrics import BLEU


predictions = ["hello there general kenobi", "foo bar foobar"]
references = [
                 ["hello there general kenobi", "hello there !"],
                 ["foo bar foobar"]
             ]

bleu = BLEU(smooth_method="none", max_ngram_order=4, tokenize='13a')
results = bleu.corpus_score(predictions, references)
print(results)

got results:

BLEU = 100.00 100.0/100.0/100.0/100.0 (BP = 1.000 ratio = 1.000 hyp_len = 4 ref_len = 4)