GEM-benchmark / NL-Augmenter

This issue concerns the following line in the main test script:

Line 26 in 27ab1d7

for pred_output, output in zip(perturbs, outputs):

The zip() builtin (which is used in the above-mentioned line to pair up expected sentences with generated sentences) clips the longer of its two inputted iterables to the length of the shorter iterable. E.g.:

>>> list(zip([1,2,3], [6,7,8,9,10]))
[(1, 6), (2, 7), (3, 8)]

This means that even if a transformation generates fewer sentences (e.g. 0) than the expected number of sentences, it will still pass and the later expected sentences will not get evaluated. This also makes it impossible to test affirmatively that a transformation does not generate any outputs for a given input.

I would recommend either asserting that the two iterables are of equal length, or replacing zip() with zip_longest().

Tests do not Check that Expected and Generated Outputs have Same Number of Sentences