Token out of vocabulary at train_gpt2.cu:675
aidando73 opened this issue · comments
I'm trying to follow #481 but I'm getting this error:
evaluating HellaSwag: 30/79
evaluating HellaSwag: 40/79
evaluating HellaSwag: 50/79
evaluating HellaSwag: 60/79
evaluating HellaSwag: 70/79
Writing state to log124M/state_00019560_00002.bin
Error: Token out of vocabulary at train_gpt2.cu:675
Error details:
File: train_gpt2.cu
Line: 675
Token: -1149026846
Position: 0
Vocab: 50257
generating:
---
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[20376,1],0]
Exit code: 1
--------------------------------------------------------------------------
Happens at the end of training. I don't end up getting the final model weights.
Running:
nice nohup bash -c 'echo "start $(date)" && mpirun -np 8 ./train_gpt2cu \
-i "dev/data/fineweb10B/fineweb_train_*.bin" \
-j "dev/data/fineweb10B/fineweb_val_*.bin" \
-o log124M \
-e "d12" \
-b 64 -t 1024 \
-d 524288 \
-r 1 \
-z 1 \
-c 0.1 \
-l 0.0006 \
-q 0.0 \
-u 700 \
-n 5000 \
-y 1 \
-v 250 -s 20000 \
-h 1 && echo "end $(date)"' &You can find the 1500 model checkpoint + state here:
https://huggingface.co/aidando73/repro-gpt-2-124M/tree/086c8895ae49f2472bcde14c7866e792b0a330f1/8x_A100_40GB/log124M
Commit hash I checked out: 7ecd890
Note that I didn't run python train_gpt2.py beforehand.
Anyone else getting this error?
Note that I didn't run python train_gpt2.py beforehand.
When I was using traing_gpt2.cu for inference, I ran into the same issue. But if I ran python train_gpt2.py beforehand I no longer ran into the issue.
My hypothesis is that -1149026846 is the end of file token that we're not setting correctly for the case where we don't run python train_gpt2.py.