Should we be making the vocab index from all paths?

Question

Should we be making the vocab index from all paths?

Adamits opened this issue 6 months ago · comments

See here: https://github.com/CUNY-CL/yoyodyne/blob/master/yoyodyne/data/datamodules.py#L73

Consider the case that we want to test for how an OOV feature messes up the model for some reason. Then, if it exists in the provided dev/test set, it will be in the vocabulary. Does not seem like a big deal but is a bit unintuitive to me. On the other hand, a good assumption is that all vocab items/features in dev should already exist in train---then why do we need to add dev to the index?

Just wanted to open this to see if there is any justification for that.

Kyle Gorman · Answer 1 · Wed Feb 07 2024 02:16:08 GMT+0800 (China Standard Time)

This means that you can have the same index-building code whether you're training or predicting and not have to think about it, and it also means you can do training and prediction in the same routine without rebuilding the dataclasses.

Adam · Answer 2 · Wed Feb 07 2024 02:28:14 GMT+0800 (China Standard Time)

I am not sure I follow. In what case could you not already have the same index-building code (by default I believe our code replaces OOVs with a special "UNK" token). That is, wouldn't we always expect that the train data exclusively builds the vocabulary that is used for prediction?

Kyle Gorman · Answer 3 · Wed Feb 07 2024 02:45:43 GMT+0800 (China Standard Time)

I am not following why you'd want to change the current behavior either.

Adam · Answer 4 · Wed Feb 07 2024 02:56:55 GMT+0800 (China Standard Time)

I think it is reasonable to expect that the model only stores a vocabulary of what it has seen during training. Suppose I have 2 characters in the dev set that are not in the train set. Then I would typically expect my model to e.g. replace them with the same UNK embedding, but in our case, we will always have unique vocabulary items initialized for them.

This is not really important since we do not train the UNK token by default. It is more that this does not seem like a typical practice.

Kyle Gorman · Answer 5 · Wed Feb 07 2024 03:02:48 GMT+0800 (China Standard Time)

I can imagine a setting where we auto-unk symbols below a frequency, but we don't have that yet. (Actually I think F**rs*q has that feature...)

Adam · Answer 6 · Wed Feb 07 2024 03:06:54 GMT+0800 (China Standard Time)

For me, for example, I was trying to reproduce someone's result for a different codebase using yoyodyne. I noticed that my vocab_size was different than theirs and was not initially sure why.

Of course, many things make reproduction hard (e.g. the gpu kernel, the order in which things are initialized, whether you query the dataloader before doing any other random operations, etc.).

Adam · Answer 7 · Wed Feb 07 2024 05:48:02 GMT+0800 (China Standard Time)

Gonna close this and open a low-priority feature request about auto-unking.

Adam · Answer 8 · Mon Feb 19 2024 05:58:44 GMT+0800 (China Standard Time)

This has been implemented in #163