Getting "Floating point exception (core dumped)" Error

Question

Getting "Floating point exception (core dumped)" Error

alvins82 opened this issue a year ago · comments

alvins82 commented a year ago

Playing around with re-creating GPT-2 from the thread. When I run train I get the above error. Screenshot below.

Diego · Answer 1 · Mon Jul 15 2024 18:22:17 GMT+0800 (China Standard Time)

No idea what is going on but maybe try compiling without CuDNN make train_gpt2cu USE_CUDNN=0. Probably not the cause, but just to check. Also run the tests see if they pass

# fp32 test (cudnn not supported)
make test_gpt2cu PRECISION=FP32 && ./test_gpt2cu
# mixed precision cudnn test
make test_gpt2cu USE_CUDNN=1 && ./test_gpt2cu

alvins82 · Answer 2 · Mon Jul 15 2024 18:28:01 GMT+0800 (China Standard Time)

No idea what is going on but maybe try compiling without CuDNN make train_gpt2cu USE_CUDNN=0. Probably not the cause, but just to check. Also run the tests see if they pass
# fp32 test (cudnn not supported)
make test_gpt2cu PRECISION=FP32 && ./test_gpt2cu
# mixed precision cudnn test
make test_gpt2cu USE_CUDNN=1 && ./test_gpt2cu

Both of the tests pass. Also put a screenshot of my torch versions above.

alvins82 · Answer 3 · Mon Jul 15 2024 18:29:31 GMT+0800 (China Standard Time)

No idea what is going on but maybe try compiling without CuDNN make train_gpt2cu USE_CUDNN=0. Probably not the cause, but just to check. Also run the tests see if they pass
# fp32 test (cudnn not supported)
make test_gpt2cu PRECISION=FP32 && ./test_gpt2cu
# mixed precision cudnn test
make test_gpt2cu USE_CUDNN=1 && ./test_gpt2cu

Aleksa Gordić · Answer 4 · Fri Jul 19 2024 23:03:07 GMT+0800 (China Standard Time)

eyeballing your cmdline i'd say your batch size is too small and is causing an exception in the hellaswag eval, this is a known issue and we have a patch merged into master that basically forces you to use batch size >= 4