CUNY-CL / yoyodyne

Small-vocabulary sequence-to-sequence generation with optional feature conditioning

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Should we be making the vocab index from all paths?

Adamits opened this issue · comments

commented

See here: https://github.com/CUNY-CL/yoyodyne/blob/master/yoyodyne/data/datamodules.py#L73

Consider the case that we want to test for how an OOV feature messes up the model for some reason. Then, if it exists in the provided dev/test set, it will be in the vocabulary. Does not seem like a big deal but is a bit unintuitive to me. On the other hand, a good assumption is that all vocab items/features in dev should already exist in train---then why do we need to add dev to the index?

Just wanted to open this to see if there is any justification for that.

This means that you can have the same index-building code whether you're training or predicting and not have to think about it, and it also means you can do training and prediction in the same routine without rebuilding the dataclasses.

commented

I am not sure I follow. In what case could you not already have the same index-building code (by default I believe our code replaces OOVs with a special "UNK" token). That is, wouldn't we always expect that the train data exclusively builds the vocabulary that is used for prediction?

I am not following why you'd want to change the current behavior either.

commented

I think it is reasonable to expect that the model only stores a vocabulary of what it has seen during training. Suppose I have 2 characters in the dev set that are not in the train set. Then I would typically expect my model to e.g. replace them with the same UNK embedding, but in our case, we will always have unique vocabulary items initialized for them.

This is not really important since we do not train the UNK token by default. It is more that this does not seem like a typical practice.

I can imagine a setting where we auto-unk symbols below a frequency, but we don't have that yet. (Actually I think F**rs*q has that feature...)

commented

For me, for example, I was trying to reproduce someone's result for a different codebase using yoyodyne. I noticed that my vocab_size was different than theirs and was not initially sure why.

Of course, many things make reproduction hard (e.g. the gpu kernel, the order in which things are initialized, whether you query the dataloader before doing any other random operations, etc.).

commented

Gonna close this and open a low-priority feature request about auto-unking.

commented

This has been implemented in #163