inf loss at big batch

Question

karpathy opened this issue 2 years ago · comments

just creating a todo. large batch sizes work now having fixed the size_t bug:

./train_gpt2cu -b 36 -v 200 -s 200 -i data/TinyStories

works, but 48 should fit but doesn't work

./train_gpt2cu -b 48 -v 200 -s 200 -i data/TinyStories

val loss is -nan and train loss stays at inf.

todo track down why and how to prevent

ngc92 · Answer 1 · Fri Jun 07 2024 18:24:52 GMT+0800 (China Standard Time)

@karpathy just wanted to check, we've fixed this, right?