karpathy / llm.c

LLM training in simple, raw C/CUDA

Repository from Github https://github.comkarpathy/llm.cRepository from Github https://github.comkarpathy/llm.c

[cudnn_frontend] Error: No execution plans support the graph.

Necktwi opened this issue · comments

necktwi@CheapFellow:~/workspace/llm.c$ make train_gpt2cu USE_CUDNN=1 CUDNN_FRONTEND_PATH="/home/necktwi/workspace/cudnn-frontend/include"

necktwi@CheapFellow:~/workspace/llm.c$ ./train_gpt2cu 
Multi-GPU support is disabled. Using a single GPU.
+-----------------------+----------------------------------------------------+
| Parameter             | Value                                              |
+-----------------------+----------------------------------------------------+
| train data pattern    | dev/data/tinyshakespeare/tiny_shakespeare_train.bin |
| val data pattern      | dev/data/tinyshakespeare/tiny_shakespeare_val.bin  |
| output log dir        | NULL                                               |
| checkpoint_every      | 0                                                  |
| resume                | 0                                                  |
| micro batch size B    | 4                                                  |
| sequence length T     | 1024                                               |
| total batch size      | 4096                                               |
| LR scheduler          | cosine                                             |
| learning rate (LR)    | 3.000000e-04                                       |
| warmup iterations     | 0                                                  |
| final LR fraction     | 1.000000e+00                                       |
| weight decay          | 0.000000e+00                                       |
| skip update lossz     | 0.000000                                           |
| skip update gradz     | 0.000000                                           |
| max_steps             | -1                                                 |
| val_loss_every        | 20                                                 |
| val_max_steps         | 20                                                 |
| sample_every          | 20                                                 |
| genT                  | 64                                                 |
| overfit_single_batch  | 0                                                  |
| use_master_weights    | enabled                                            |
| gelu_fusion           | 0                                                  |
| recompute             | 1                                                  |
+-----------------------+----------------------------------------------------+
| device                | NVIDIA GeForce RTX 2060                            |
| peak TFlops           | -1.0                                               |
| precision             | BF16                                               |
+-----------------------+----------------------------------------------------+
| weight init method    | gpt2_124M_bf16.bin                                 |
| max_sequence_length T | 1024                                               |
| vocab_size V          | 50257                                              |
| padded_vocab_size Vp  | 50304                                              |
| num_layers L          | 12                                                 |
| num_heads NH          | 12                                                 |
| channels C            | 768                                                |
| num_parameters        | 124475904                                          |
+-----------------------+----------------------------------------------------+
| train_num_batches     | 74                                                 |
| val_num_batches       | 20                                                 |
+-----------------------+----------------------------------------------------+
| run hellaswag         | no                                                 |
+-----------------------+----------------------------------------------------+
| Zero Optimization is disabled                                              |
| num_processes         | 1                                                  |
| zero_stage            | 0                                                  |
+-----------------------+----------------------------------------------------+
num_parameters: 124475904 => bytes: 248951808
allocated 237 MiB for model parameters
batch_size B=4 * seq_len T=1024 * num_processes=1 and total_batch_size=4096
=> setting grad_accum_steps=1
allocating 237 MiB for parameter gradients
allocating 1326 MiB for activations
allocating 474 MiB for AdamW optimizer state m
allocating 474 MiB for AdamW optimizer state v
allocating 474 MiB for master copy of params
device memory usage: 3652 MiB / 5740 MiB
memory per sequence: 331 MiB
 -> estimated maximum batch size: 10
[CUDNN ERROR] at file llmc/cudnn_att.cpp:120:
[cudnn_frontend] Error: No execution plans support the graph.
 ~/e/llm.c  master ▓▒░ make train_gpt2cu USE_CUDNN=1                                         ░▒▓ ✔  32s  base Py  tesla@tesla  1 task  11:40:48 AM 
---------------------------------------------
✓ cuDNN found, will run with flash-attention
✓ OpenMP found
✓ NCCL found, OK to train with multiple GPUs
✓ MPI enabled
✓ nvcc found, including GPU/CUDA support
---------------------------------------------
/usr/local/cuda/bin/nvcc -c --threads=0 -t=0 --use_fast_math -std=c++17 -O3 -DENABLE_CUDNN -DMULTI_GPU -DUSE_MPI -DENABLE_BF16 llmc/cudnn_att.cpp -I/home/tesla/cudnn-frontend/include -I/usr/lib/x86_64-linux-gnu/openmpi/include/ -o build/cudnn_att.o
/usr/local/cuda/bin/nvcc --threads=0 -t=0 --use_fast_math -std=c++17 -O3 -DENABLE_CUDNN -DMULTI_GPU -DUSE_MPI -DENABLE_BF16 train_gpt2.cu build/cudnn_att.o -lcublas -lcublasLt -lnvidia-ml -lcudnn -L/usr/lib/x86_64-linux-gnu/openmpi/lib/ -I/home/tesla/cudnn-frontend/include -I/usr/lib/x86_64-linux-gnu/openmpi/include/ -lnccl -lmpi -o train_gpt2cu

 ~/e/llm.c  master ▓▒░ ./train_gpt2cu                                                        ░▒▓ ✔  34s  base Py  tesla@tesla  1 task  11:41:23 AM 
+-----------------------+----------------------------------------------------+
| Parameter             | Value                                              |
+-----------------------+----------------------------------------------------+
| train data pattern    | dev/data/tinyshakespeare/tiny_shakespeare_train.bin |
| val data pattern      | dev/data/tinyshakespeare/tiny_shakespeare_val.bin  |
| output log dir        | NULL                                               |
| checkpoint_every      | 0                                                  |
| resume                | 0                                                  |
| micro batch size B    | 4                                                  |
| sequence length T     | 1024                                               |
| total batch size      | 4096                                               |
| LR scheduler          | cosine                                             |
| learning rate (LR)    | 3.000000e-04                                       |
| warmup iterations     | 0                                                  |
| final LR fraction     | 1.000000e+00                                       |
| weight decay          | 0.000000e+00                                       |
| skip update lossz     | 0.000000                                           |
| skip update gradz     | 0.000000                                           |
| max_steps             | -1                                                 |
| val_loss_every        | 20                                                 |
| val_max_steps         | 20                                                 |
| sample_every          | 20                                                 |
| genT                  | 64                                                 |
| overfit_single_batch  | 0                                                  |
| use_master_weights    | enabled                                            |
| gelu_fusion           | 0                                                  |
| recompute             | 1                                                  |
+-----------------------+----------------------------------------------------+
| device                | NVIDIA GeForce RTX 2060                            |
| peak TFlops           | -1.0                                               |
| precision             | BF16                                               |
+-----------------------+----------------------------------------------------+
| weight init method    | gpt2_124M_bf16.bin                                 |
| max_sequence_length T | 1024                                               |
| vocab_size V          | 50257                                              |
| padded_vocab_size Vp  | 50304                                              |
| num_layers L          | 12                                                 |
| num_heads NH          | 12                                                 |
| channels C            | 768                                                |
| num_parameters        | 124475904                                          |
+-----------------------+----------------------------------------------------+
| train_num_batches     | 74                                                 |
| val_num_batches       | 20                                                 |
+-----------------------+----------------------------------------------------+
| run hellaswag         | no                                                 |
+-----------------------+----------------------------------------------------+
| Zero Optimization is disabled                                              |
| num_processes         | 1                                                  |
| zero_stage            | 0                                                  |
+-----------------------+----------------------------------------------------+
num_parameters: 124475904 => bytes: 248951808
allocated 237 MiB for model parameters
batch_size B=4 * seq_len T=1024 * num_processes=1 and total_batch_size=4096
=> setting grad_accum_steps=1
allocating 237 MiB for parameter gradients
allocating 1326 MiB for activations
allocating 474 MiB for AdamW optimizer state m
allocating 474 MiB for AdamW optimizer state v
allocating 474 MiB for master copy of params
device memory usage: 3566 MiB / 5919 MiB
memory per sequence: 331 MiB
 -> estimated maximum batch size: 11
[CUDNN ERROR] at file llmc/cudnn_att.cpp:120:
[cudnn_frontend] Error: No execution plans support the graph.

./train_gpt2cu  # gives error

Python files runs.

 ~/e/llm.c  master ▓▒░ python3 train_gpt2.py                                                     ░▒▓ 1 ✘  base Py  tesla@tesla  1 task  11:41:27 AM 
/home/tesla/exp/llm.c/train_gpt2.py:34: DeprecationWarning: `TorchScript` support for functional optimizers is deprecated and will be removed in a future PyTorch release. Consider using the `torch.compile` optimizer instead.
  from torch.distributed.optim import ZeroRedundancyOptimizer
Running pytorch 2.4.0+cu121
using device: cuda
total desired batch size: 256
=> calculated gradient accumulation steps: 1
wrote gpt2_tokenizer.bin
loading weights from pretrained gpt: gpt2
DataLoader: total number of tokens: 32,768 across 1 files
padded vocab size from 50257 to 50304
wrote gpt2_124M.bin
padded vocab size from 50257 to 50304
wrote gpt2_124M_bf16.bin
padded vocab size in reference grads from 50257 to 50304
wrote gpt2_124M_debug_state.bin
num decayed parameter tensors: 50, with 124,318,464 parameters
num non-decayed parameter tensors: 98, with 121,344 parameters
using fused AdamW: True
using regular AdamW
step    1/10 | train loss 5.270009 | norm 30.5000 | lr 1.00e-04 | (135.69 ms | 1887 tok/s)
step    2/10 | train loss 4.060703 | norm 17.0772 | lr 1.00e-04 | (104.22 ms | 2456 tok/s)
step    3/10 | train loss 3.320115 | norm 14.7840 | lr 1.00e-04 | (96.73 ms | 2647 tok/s)
step    4/10 | train loss 2.717573 | norm 13.1957 | lr 1.00e-04 | (100.47 ms | 2548 tok/s)
step    5/10 | train loss 2.181084 | norm 12.3892 | lr 1.00e-04 | (103.23 ms | 2480 tok/s)
step    6/10 | train loss 1.653934 | norm 10.6317 | lr 1.00e-04 | (97.49 ms | 2626 tok/s)
step    7/10 | train loss 1.168067 | norm 9.7828 | lr 1.00e-04 | (98.37 ms | 2602 tok/s)
step    8/10 | train loss 0.736853 | norm 8.1185 | lr 1.00e-04 | (100.69 ms | 2543 tok/s)
step    9/10 | train loss 0.400987 | norm 6.2682 | lr 1.00e-04 | (104.10 ms | 2459 tok/s)
step   10/10 | train loss 0.187464 | norm 3.6643 | lr 1.00e-04 | (97.34 ms | 2630 tok/s)
final 9 iters avg: 100.293ms
peak memory consumption: 2320 MiB

The error is due to cudnn-frontend? or something else?
The python is able to run the test while ./train_gpt2cu gives error

i meet the error too, have you slove it? it's look like error from cudnn-frontend

Anyone having an idea how to address this? Using Ubuntu, CUDA 12.6 and cuDNN the same problem occurs.
(Without cuDNN extremely slow)

cuDNN SDPA doesn't support Turing GPUs.

Dear @mnicely thanks a lot, this clarifies it!