karpathy / llm.c

LLM training in simple, raw C/CUDA

Repository from Github https://github.comkarpathy/llm.cRepository from Github https://github.comkarpathy/llm.c

test_gpt2.cu correctness bounds tune per-parameter

karpathy opened this issue · comments

adding a todo

this is me being a bit paranoid but in test_gpt2.cu we check that our code agrees with pytorch reference. we're using a single global threshold for all comparisons of 1e-2. we could instead compare the gradients on the parameters parameter by parameter, and tune this amount to be per-parameter as low as we can make it, maybe eyeballing a plus ~10% buffer. otherwise my concern is that one global 1e-2 could be too large for some of these parameter gradients in absolute terms, and we could be making silent errors with new kernels. when we "trip the wire" with a new kernel, we should inspect manually and carefully that things are ok despite tripping the check, and it's okay to increase the bound.

the code for checking all parameters is already there, but commented out.

would welcome a PR that digs into this on per-parameter basis and looks at what thresholds we can get away with in this comparison.

Still a beginner at writing kernels, would love to work on this issue and delve deep experimenting with the weights this summer.