test_gpt2.cu correctness bounds tune per-parameter

Question

test_gpt2.cu correctness bounds tune per-parameter

karpathy opened this issue 2 years ago · comments

adding a todo

this is me being a bit paranoid but in test_gpt2.cu we check that our code agrees with pytorch reference. we're using a single global threshold for all comparisons of 1e-2. we could instead compare the gradients on the parameters parameter by parameter, and tune this amount to be per-parameter as low as we can make it, maybe eyeballing a plus ~10% buffer. otherwise my concern is that one global 1e-2 could be too large for some of these parameter gradients in absolute terms, and we could be making silent errors with new kernels. when we "trip the wire" with a new kernel, we should inspect manually and carefully that things are ok despite tripping the check, and it's okay to increase the bound.

the code for checking all parameters is already there, but commented out.

would welcome a PR that digs into this on per-parameter basis and looks at what thresholds we can get away with in this comparison.

Swayam Gupta · Answer 1 · Sun May 26 2024 05:43:15 GMT+0800 (China Standard Time)

Still a beginner at writing kernels, would love to work on this issue and delve deep experimenting with the weights this summer.