Question about prompt-based finetuning and automatic selection of label words

Question

Question about prompt-based finetuning and automatic selection of label words

pzzhang opened this issue 2 years ago · comments

In the paper, it mentions "Let M: Y → V be a mapping from the task label space to individual words in the vocabulary V of L." Here, V is the set of "individual words" or "individual sub-words"?

I noticed that many auto-generated label words, such as "unforgettable/extraordinary/good/better/terrible" in SST-5 (Table E.1), are very long and should not be a single sub-word (from the view of a Roberta tokenizer). Then it seems that each label may contain multiple sub-words. In this case, the following sentence is confusing:
"Then for each xin, let the manipulation xprompt = T (xin) be a masked language modeling (MLM) input which contains one [MASK] token."
I'm not sure how one [MASK] token can reconstruct multiple tokens (sub-words), like "unforgettable".

This issue is also related to the automatic selection of label words, to determine whether we are searching over all the sub-words or all the words.

Could the authors clarify this detail?

Adam Fisch · Answer 1 · Wed Dec 22 2021 05:41:40 GMT+0800 (China Standard Time)

All of those mentioned label words are indeed words in the Roberta vocabulary: https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-vocab.json

Pengchuan Zhang · Answer 2 · Wed Dec 22 2021 06:04:39 GMT+0800 (China Standard Time)

Thank @ajfisch for the quick reply! I got surprised that it's a single token for such a long word.

Anyway, do you have ways to handle words that maybe spited into multiple tokens?

Danqi Chen · Answer 3 · Fri Jan 14 2022 10:28:48 GMT+0800 (China Standard Time)

@pzzhang It could be an interesting idea but we didn't investigate it in the paper. For the automatic label search part, we explicitly only enumerate tokens in the vocabulary for each label. Using multiple tokens to represent labels may lead to an imprecise estimate of probabilities I assume.