Cudnn error cudnn_att.cpp on train_gptcu
maderix opened this issue · comments
Tried to replicate the gpt-2 training described here on signle RTX 4090 running under wsl2
#481
Training comand:
./train_gpt2cu -i "dev/data/fineweb10B/fineweb_train_.bin" -j "dev/data/fineweb10B/fineweb_val_.bin" -o log124M -e "d12" -b 1 -t 1024 -d 32768 -r 1 -z 1 -c 0.1 -l 0.0006 -q 0.0 -u 700 -n 5000 -v 250 -s 20000 -h 0 -o logs
With cudnn enabled, I'm getting below error:

HellaSwag eval not found at dev/data/hellaswag/hellaswag_val.bin, skipping its evaluation You can run python dev/data/hellaswag.pyto export and use it with-h 1. num_parameters: 124475904 => bytes: 248951808 allocated 237 MiB for model parameters batch_size B=1 * seq_len T=1024 * num_processes=1 and total_batch_size=32768 => setting grad_accum_steps=32 allocating 359 MiB for activations [CUDNN ERROR] at file cudnn_att.cpp:141: [cudnn_frontend] Error: No execution plans built successfully.
Without cudnn enabled, it runs fine albeit with much lower MFU. Is there a way to get more verbose logs from cudnn where the failure happened.
there is an environment variable that can be set:
https://docs.nvidia.com/deeplearning/cudnn/latest/reference/troubleshooting.html
also, this might be the same problem as #366
Hey @maderix !
Couple of notes:
-
Hellaswag eval not foundis not really an error, that just means the script couldn't find the binary file that contains Hellaswag eval data. To fix this run the Python script insidedev/data/hellaswag.py. It will download the dataset, tokenize it and save as the bin file. That'll make that "error"/warning go away. -
Hopefully you're on the lastest head commit since MFU logic recently changed. We use to display it always in comparison to A100's bf16 peak (312 TFlops) which means it wasn't relevant unless you're running on A100. That should be fixed now, you can find here the GPUs that are supported: https://github.com/karpathy/llm.c/blob/master/llmc/mfu.h#L39
If this solves your issue, please close it, if not here to help! :)
hi @gordicaleksa @ngc92
Thanks for your help. On the latest tip I can get the training going at a much better MFU:

I think 70% is pretty good, considering that the GPU also has to do other stuff except bf16 matrix multiplications
@maderix Hi , how can you solve the porblem : No executions plant built sucsessfully.