[cudnn_frontend] Error: No execution plans support the graph.
Necktwi opened this issue · comments
necktwi@CheapFellow:~/workspace/llm.c$ make train_gpt2cu USE_CUDNN=1 CUDNN_FRONTEND_PATH="/home/necktwi/workspace/cudnn-frontend/include"
necktwi@CheapFellow:~/workspace/llm.c$ ./train_gpt2cu
Multi-GPU support is disabled. Using a single GPU.
+-----------------------+----------------------------------------------------+
| Parameter | Value |
+-----------------------+----------------------------------------------------+
| train data pattern | dev/data/tinyshakespeare/tiny_shakespeare_train.bin |
| val data pattern | dev/data/tinyshakespeare/tiny_shakespeare_val.bin |
| output log dir | NULL |
| checkpoint_every | 0 |
| resume | 0 |
| micro batch size B | 4 |
| sequence length T | 1024 |
| total batch size | 4096 |
| LR scheduler | cosine |
| learning rate (LR) | 3.000000e-04 |
| warmup iterations | 0 |
| final LR fraction | 1.000000e+00 |
| weight decay | 0.000000e+00 |
| skip update lossz | 0.000000 |
| skip update gradz | 0.000000 |
| max_steps | -1 |
| val_loss_every | 20 |
| val_max_steps | 20 |
| sample_every | 20 |
| genT | 64 |
| overfit_single_batch | 0 |
| use_master_weights | enabled |
| gelu_fusion | 0 |
| recompute | 1 |
+-----------------------+----------------------------------------------------+
| device | NVIDIA GeForce RTX 2060 |
| peak TFlops | -1.0 |
| precision | BF16 |
+-----------------------+----------------------------------------------------+
| weight init method | gpt2_124M_bf16.bin |
| max_sequence_length T | 1024 |
| vocab_size V | 50257 |
| padded_vocab_size Vp | 50304 |
| num_layers L | 12 |
| num_heads NH | 12 |
| channels C | 768 |
| num_parameters | 124475904 |
+-----------------------+----------------------------------------------------+
| train_num_batches | 74 |
| val_num_batches | 20 |
+-----------------------+----------------------------------------------------+
| run hellaswag | no |
+-----------------------+----------------------------------------------------+
| Zero Optimization is disabled |
| num_processes | 1 |
| zero_stage | 0 |
+-----------------------+----------------------------------------------------+
num_parameters: 124475904 => bytes: 248951808
allocated 237 MiB for model parameters
batch_size B=4 * seq_len T=1024 * num_processes=1 and total_batch_size=4096
=> setting grad_accum_steps=1
allocating 237 MiB for parameter gradients
allocating 1326 MiB for activations
allocating 474 MiB for AdamW optimizer state m
allocating 474 MiB for AdamW optimizer state v
allocating 474 MiB for master copy of params
device memory usage: 3652 MiB / 5740 MiB
memory per sequence: 331 MiB
-> estimated maximum batch size: 10
[CUDNN ERROR] at file llmc/cudnn_att.cpp:120:
[cudnn_frontend] Error: No execution plans support the graph.
~/e/llm.c master ▓▒░ make train_gpt2cu USE_CUDNN=1 ░▒▓ ✔ 32s base Py tesla@tesla 1 task 11:40:48 AM
---------------------------------------------
✓ cuDNN found, will run with flash-attention
✓ OpenMP found
✓ NCCL found, OK to train with multiple GPUs
✓ MPI enabled
✓ nvcc found, including GPU/CUDA support
---------------------------------------------
/usr/local/cuda/bin/nvcc -c --threads=0 -t=0 --use_fast_math -std=c++17 -O3 -DENABLE_CUDNN -DMULTI_GPU -DUSE_MPI -DENABLE_BF16 llmc/cudnn_att.cpp -I/home/tesla/cudnn-frontend/include -I/usr/lib/x86_64-linux-gnu/openmpi/include/ -o build/cudnn_att.o
/usr/local/cuda/bin/nvcc --threads=0 -t=0 --use_fast_math -std=c++17 -O3 -DENABLE_CUDNN -DMULTI_GPU -DUSE_MPI -DENABLE_BF16 train_gpt2.cu build/cudnn_att.o -lcublas -lcublasLt -lnvidia-ml -lcudnn -L/usr/lib/x86_64-linux-gnu/openmpi/lib/ -I/home/tesla/cudnn-frontend/include -I/usr/lib/x86_64-linux-gnu/openmpi/include/ -lnccl -lmpi -o train_gpt2cu
~/e/llm.c master ▓▒░ ./train_gpt2cu ░▒▓ ✔ 34s base Py tesla@tesla 1 task 11:41:23 AM
+-----------------------+----------------------------------------------------+
| Parameter | Value |
+-----------------------+----------------------------------------------------+
| train data pattern | dev/data/tinyshakespeare/tiny_shakespeare_train.bin |
| val data pattern | dev/data/tinyshakespeare/tiny_shakespeare_val.bin |
| output log dir | NULL |
| checkpoint_every | 0 |
| resume | 0 |
| micro batch size B | 4 |
| sequence length T | 1024 |
| total batch size | 4096 |
| LR scheduler | cosine |
| learning rate (LR) | 3.000000e-04 |
| warmup iterations | 0 |
| final LR fraction | 1.000000e+00 |
| weight decay | 0.000000e+00 |
| skip update lossz | 0.000000 |
| skip update gradz | 0.000000 |
| max_steps | -1 |
| val_loss_every | 20 |
| val_max_steps | 20 |
| sample_every | 20 |
| genT | 64 |
| overfit_single_batch | 0 |
| use_master_weights | enabled |
| gelu_fusion | 0 |
| recompute | 1 |
+-----------------------+----------------------------------------------------+
| device | NVIDIA GeForce RTX 2060 |
| peak TFlops | -1.0 |
| precision | BF16 |
+-----------------------+----------------------------------------------------+
| weight init method | gpt2_124M_bf16.bin |
| max_sequence_length T | 1024 |
| vocab_size V | 50257 |
| padded_vocab_size Vp | 50304 |
| num_layers L | 12 |
| num_heads NH | 12 |
| channels C | 768 |
| num_parameters | 124475904 |
+-----------------------+----------------------------------------------------+
| train_num_batches | 74 |
| val_num_batches | 20 |
+-----------------------+----------------------------------------------------+
| run hellaswag | no |
+-----------------------+----------------------------------------------------+
| Zero Optimization is disabled |
| num_processes | 1 |
| zero_stage | 0 |
+-----------------------+----------------------------------------------------+
num_parameters: 124475904 => bytes: 248951808
allocated 237 MiB for model parameters
batch_size B=4 * seq_len T=1024 * num_processes=1 and total_batch_size=4096
=> setting grad_accum_steps=1
allocating 237 MiB for parameter gradients
allocating 1326 MiB for activations
allocating 474 MiB for AdamW optimizer state m
allocating 474 MiB for AdamW optimizer state v
allocating 474 MiB for master copy of params
device memory usage: 3566 MiB / 5919 MiB
memory per sequence: 331 MiB
-> estimated maximum batch size: 11
[CUDNN ERROR] at file llmc/cudnn_att.cpp:120:
[cudnn_frontend] Error: No execution plans support the graph.
./train_gpt2cu # gives error
Python files runs.
~/e/llm.c master ▓▒░ python3 train_gpt2.py ░▒▓ 1 ✘ base Py tesla@tesla 1 task 11:41:27 AM
/home/tesla/exp/llm.c/train_gpt2.py:34: DeprecationWarning: `TorchScript` support for functional optimizers is deprecated and will be removed in a future PyTorch release. Consider using the `torch.compile` optimizer instead.
from torch.distributed.optim import ZeroRedundancyOptimizer
Running pytorch 2.4.0+cu121
using device: cuda
total desired batch size: 256
=> calculated gradient accumulation steps: 1
wrote gpt2_tokenizer.bin
loading weights from pretrained gpt: gpt2
DataLoader: total number of tokens: 32,768 across 1 files
padded vocab size from 50257 to 50304
wrote gpt2_124M.bin
padded vocab size from 50257 to 50304
wrote gpt2_124M_bf16.bin
padded vocab size in reference grads from 50257 to 50304
wrote gpt2_124M_debug_state.bin
num decayed parameter tensors: 50, with 124,318,464 parameters
num non-decayed parameter tensors: 98, with 121,344 parameters
using fused AdamW: True
using regular AdamW
step 1/10 | train loss 5.270009 | norm 30.5000 | lr 1.00e-04 | (135.69 ms | 1887 tok/s)
step 2/10 | train loss 4.060703 | norm 17.0772 | lr 1.00e-04 | (104.22 ms | 2456 tok/s)
step 3/10 | train loss 3.320115 | norm 14.7840 | lr 1.00e-04 | (96.73 ms | 2647 tok/s)
step 4/10 | train loss 2.717573 | norm 13.1957 | lr 1.00e-04 | (100.47 ms | 2548 tok/s)
step 5/10 | train loss 2.181084 | norm 12.3892 | lr 1.00e-04 | (103.23 ms | 2480 tok/s)
step 6/10 | train loss 1.653934 | norm 10.6317 | lr 1.00e-04 | (97.49 ms | 2626 tok/s)
step 7/10 | train loss 1.168067 | norm 9.7828 | lr 1.00e-04 | (98.37 ms | 2602 tok/s)
step 8/10 | train loss 0.736853 | norm 8.1185 | lr 1.00e-04 | (100.69 ms | 2543 tok/s)
step 9/10 | train loss 0.400987 | norm 6.2682 | lr 1.00e-04 | (104.10 ms | 2459 tok/s)
step 10/10 | train loss 0.187464 | norm 3.6643 | lr 1.00e-04 | (97.34 ms | 2630 tok/s)
final 9 iters avg: 100.293ms
peak memory consumption: 2320 MiB
The error is due to cudnn-frontend? or something else?
The python is able to run the test while ./train_gpt2cu gives error
i meet the error too, have you slove it? it's look like error from cudnn-frontend
Anyone having an idea how to address this? Using Ubuntu, CUDA 12.6 and cuDNN the same problem occurs.
(Without cuDNN extremely slow)
cuDNN SDPA doesn't support Turing GPUs.
Dear @mnicely thanks a lot, this clarifies it!