karpathy / llm.c

LLM training in simple, raw C/CUDA

Repository from Github https://github.comkarpathy/llm.cRepository from Github https://github.comkarpathy/llm.c

Cudnn error cudnn_att.cpp on train_gptcu

maderix opened this issue · comments

Tried to replicate the gpt-2 training described here on signle RTX 4090 running under wsl2
#481

Training comand:
./train_gpt2cu -i "dev/data/fineweb10B/fineweb_train_.bin" -j "dev/data/fineweb10B/fineweb_val_.bin" -o log124M -e "d12" -b 1 -t 1024 -d 32768 -r 1 -z 1 -c 0.1 -l 0.0006 -q 0.0 -u 700 -n 5000 -v 250 -s 20000 -h 0 -o logs

With cudnn enabled, I'm getting below error:
image

HellaSwag eval not found at dev/data/hellaswag/hellaswag_val.bin, skipping its evaluation You can run python dev/data/hellaswag.pyto export and use it with-h 1. num_parameters: 124475904 => bytes: 248951808 allocated 237 MiB for model parameters batch_size B=1 * seq_len T=1024 * num_processes=1 and total_batch_size=32768 => setting grad_accum_steps=32 allocating 359 MiB for activations [CUDNN ERROR] at file cudnn_att.cpp:141: [cudnn_frontend] Error: No execution plans built successfully.

Without cudnn enabled, it runs fine albeit with much lower MFU. Is there a way to get more verbose logs from cudnn where the failure happened.

commented

there is an environment variable that can be set:
https://docs.nvidia.com/deeplearning/cudnn/latest/reference/troubleshooting.html

also, this might be the same problem as #366

Hey @maderix !

Couple of notes:

  1. Hellaswag eval not found is not really an error, that just means the script couldn't find the binary file that contains Hellaswag eval data. To fix this run the Python script inside dev/data/hellaswag.py. It will download the dataset, tokenize it and save as the bin file. That'll make that "error"/warning go away.

  2. Hopefully you're on the lastest head commit since MFU logic recently changed. We use to display it always in comparison to A100's bf16 peak (312 TFlops) which means it wasn't relevant unless you're running on A100. That should be fixed now, you can find here the GPUs that are supported: https://github.com/karpathy/llm.c/blob/master/llmc/mfu.h#L39

If this solves your issue, please close it, if not here to help! :)

hi @gordicaleksa @ngc92
Thanks for your help. On the latest tip I can get the training going at a much better MFU:
image

commented

I think 70% is pretty good, considering that the GPU also has to do other stuff except bf16 matrix multiplications

@maderix Hi , how can you solve the porblem : No executions plant built sucsessfully.