Getting "Floating point exception (core dumped)" Error
alvins82 opened this issue · comments
No idea what is going on but maybe try compiling without CuDNN make train_gpt2cu USE_CUDNN=0. Probably not the cause, but just to check. Also run the tests see if they pass
# fp32 test (cudnn not supported)
make test_gpt2cu PRECISION=FP32 && ./test_gpt2cu
# mixed precision cudnn test
make test_gpt2cu USE_CUDNN=1 && ./test_gpt2cu
No idea what is going on but maybe try compiling without CuDNN
make train_gpt2cu USE_CUDNN=0. Probably not the cause, but just to check. Also run the tests see if they pass# fp32 test (cudnn not supported) make test_gpt2cu PRECISION=FP32 && ./test_gpt2cu # mixed precision cudnn test make test_gpt2cu USE_CUDNN=1 && ./test_gpt2cu
Both of the tests pass. Also put a screenshot of my torch versions above.
No idea what is going on but maybe try compiling without CuDNN
make train_gpt2cu USE_CUDNN=0. Probably not the cause, but just to check. Also run the tests see if they pass# fp32 test (cudnn not supported) make test_gpt2cu PRECISION=FP32 && ./test_gpt2cu # mixed precision cudnn test make test_gpt2cu USE_CUDNN=1 && ./test_gpt2cu

eyeballing your cmdline i'd say your batch size is too small and is causing an exception in the hellaswag eval, this is a known issue and we have a patch merged into master that basically forces you to use batch size >= 4


