karpathy / llm.c

LLM training in simple, raw C/CUDA

Repository from Github https://github.comkarpathy/llm.cRepository from Github https://github.comkarpathy/llm.c

inf loss at big batch

karpathy opened this issue · comments

just creating a todo. large batch sizes work now having fixed the size_t bug:

./train_gpt2cu -b 36 -v 200 -s 200 -i data/TinyStories

works, but 48 should fit but doesn't work

./train_gpt2cu -b 48 -v 200 -s 200 -i data/TinyStories

val loss is -nan and train loss stays at inf.

todo track down why and how to prevent

commented

@karpathy just wanted to check, we've fixed this, right?