Resize embeddings so they are divisible by 64
acforvs opened this issue · comments
Vlad commented
Hi, thanks for open sourcing the project!
Currently, the size of embeddings for StarCoder is 49152, but after one token is added it gets up to 49153 which makes it impossible to shard the model across any conventional number of GPUs (like 4 or 8).
I wonder whether it would be a correct option to add 7/15/63 random tokens like <filler_token_i> here https://github.com/nlpxucan/WizardLM/blob/main/WizardCoder/src/train_wizardcoder.py#L194 to be able to shard the model.
Do you have any suggestions about whether this seems reasonable? Thanks!
ChiYeung Law commented
I think this is reasonable.