Cudnn error cudnn_att.cpp on train_gptcu

Question

Cudnn error cudnn_att.cpp on train_gptcu

maderix opened this issue a year ago · comments

Tried to replicate the gpt-2 training described here on signle RTX 4090 running under wsl2
#481

Training comand:
./train_gpt2cu -i "dev/data/fineweb10B/fineweb_train_.bin" -j "dev/data/fineweb10B/fineweb_val_.bin" -o log124M -e "d12" -b 1 -t 1024 -d 32768 -r 1 -z 1 -c 0.1 -l 0.0006 -q 0.0 -u 700 -n 5000 -v 250 -s 20000 -h 0 -o logs

With cudnn enabled, I'm getting below error:

HellaSwag eval not found at dev/data/hellaswag/hellaswag_val.bin, skipping its evaluation You can run python dev/data/hellaswag.pyto export and use it with-h 1. num_parameters: 124475904 => bytes: 248951808 allocated 237 MiB for model parameters batch_size B=1 * seq_len T=1024 * num_processes=1 and total_batch_size=32768 => setting grad_accum_steps=32 allocating 359 MiB for activations [CUDNN ERROR] at file cudnn_att.cpp:141: [cudnn_frontend] Error: No execution plans built successfully.

Without cudnn enabled, it runs fine albeit with much lower MFU. Is there a way to get more verbose logs from cudnn where the failure happened.

ngc92 · Answer 1 · Fri Jun 07 2024 18:21:00 GMT+0800 (China Standard Time)

there is an environment variable that can be set:
https://docs.nvidia.com/deeplearning/cudnn/latest/reference/troubleshooting.html

also, this might be the same problem as #366

Aleksa Gordić · Answer 2 · Fri Jun 07 2024 21:39:14 GMT+0800 (China Standard Time)

Hey @maderix !

Couple of notes:

Hellaswag eval not found is not really an error, that just means the script couldn't find the binary file that contains Hellaswag eval data. To fix this run the Python script inside dev/data/hellaswag.py. It will download the dataset, tokenize it and save as the bin file. That'll make that "error"/warning go away.
Hopefully you're on the lastest head commit since MFU logic recently changed. We use to display it always in comparison to A100's bf16 peak (312 TFlops) which means it wasn't relevant unless you're running on A100. That should be fixed now, you can find here the GPUs that are supported: https://github.com/karpathy/llm.c/blob/master/llmc/mfu.h#L39

If this solves your issue, please close it, if not here to help! :)

Manjeet Singh · Answer 3 · Fri Jun 07 2024 22:01:12 GMT+0800 (China Standard Time)

hi @gordicaleksa @ngc92
Thanks for your help. On the latest tip I can get the training going at a much better MFU:

ngc92 · Answer 4 · Fri Jun 07 2024 22:18:39 GMT+0800 (China Standard Time)

I think 70% is pretty good, considering that the GPU also has to do other stuff except bf16 matrix multiplications

Deniz Karabacak · Answer 5 · Sat Jul 27 2024 02:50:08 GMT+0800 (China Standard Time)

@maderix Hi , how can you solve the porblem : No executions plant built sucsessfully.