bigscience-workshop / promptsource

Toolkit for creating, sharing and using natural language prompts.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Examples where >1 targets?

Muennighoff opened this issue · comments

As you changed the signature of apply() to return a list for targets instead of a string, can you point me to some datasets that use multiple targets?
Is randomly picking one the best to get it back to a single string?

Hey @Muennighoff ! Unless I'm misremembering, this is only changed on the eval-hackathon branch. This was a feature requested by the eval team, and maybe @cjlovering or @jordiclive can point to them. My understanding is that it's motivated by generation datasets that have multiple possible valid targets. We're not 100% decided that the current API in eval-hackathon will be merged as-is to main. For example, we could keep the current API as is (but return just the first target when there are multiple) and add another method that returns all targets. Open to suggestions!

@Muennighoff, GEM/web_nlg and GEM/wiki_auto_asset_turk are examples of multiple references. For example GEM/wiki_auto_asset_turk/test_asset has 10 references.

Yes, the reasoning is because in NLG, having one reference is quite often unreliable, so a lot of test sets are designed with multiple references where multi-ref metrics should be used. Multi-ref metric (for bleu, rouge, sari) support was also added to the Bigscience EH because of this. I know other NLG datasets that are intended for multi-ref test sets are E2E and Totto, but not sure if they were implemented.

For promptsource main, I would suggest supporting multi-references for these datasets. Choosing one or even a random single reference for analysis makes results not comparable.