karpathy / llm.c

LLM training in simple, raw C/CUDA

Repository from Github https://github.comkarpathy/llm.cRepository from Github https://github.comkarpathy/llm.c

MPI run error

wzzanthony opened this issue · comments

I tried to run the program but I met the following error.

+-----------------------+----------------------------------------------------+
| Parameter | Value |
+-----------------------+----------------------------------------------------+
| train data pattern | ../fineweb10B/fineweb_train_.bin |
| val data pattern | ../fineweb10B/fineweb_val_
.bin |
| output log dir | log124M |
| checkpoint_every | 5000 |
| resume | 0 |
| micro batch size B | 64 |
| sequence length T | 1024 |
| total batch size | 524288 |
| LR scheduler | cosine |
| learning rate (LR) | 6.000000e-04 |
| warmup iterations | 700 |
| final LR fraction | 0.000000e+00 |
| weight decay | 1.000000e-01 |
| skip update lossz | 0.000000 |
| skip update gradz | 0.000000 |
| max_steps | -1 |
| val_loss_every | 250 |
| val_max_steps | 20 |
| sample_every | 20000 |
| genT | 64 |
| overfit_single_batch | 0 |
| use_master_weights | enabled |
| gelu_fusion | 0 |
| recompute | 1 |
+-----------------------+----------------------------------------------------+
| device | NVIDIA A100-SXM4-80GB |
| peak TFlops | 312.0 |
| precision | BF16 |
+-----------------------+----------------------------------------------------+
| weight init method | d12 |
| max_sequence_length T | 1024 |
| vocab_size V | 50257 |
| padded_vocab_size Vp | 50304 |
| num_layers L | 12 |
| num_heads NH | 12 |
| channels C | 768 |
| num_parameters | 124475904 |
+-----------------------+----------------------------------------------------+
| train_num_batches | 19560 |
| val_num_batches | 20 |
+-----------------------+----------------------------------------------------+
| run hellaswag | no |
+-----------------------+----------------------------------------------------+
| num_processes | 8 |
| zero_stage | 1 |
+-----------------------+----------------------------------------------------+
HellaSwag eval not found at dev/data/hellaswag/hellaswag_val.bin, skipping its evaluation
You can run python dev/data/hellaswag.py to export and use it with -h 1.
num_parameters: 124475904 => bytes: 248951808
allocated 237 MiB for model parameters
batch_size B=64 * seq_len T=1024 * num_processes=8 and total_batch_size=524288
=> setting grad_accum_steps=1

WARNING: Failed to open the tokenizer file gpt2_tokenizer.bin
The Tokenizer is a new feature added April 14 2024.
Re-run python train_gpt2.py to write it

allocating 237 MiB for parameter gradients

WARNING: Failed to open the tokenizer file gpt2_tokenizer.bin
The Tokenizer is a new feature added April 14 2024.
Re-run python train_gpt2.py to write it


WARNING: Failed to open the tokenizer file gpt2_tokenizer.bin
The Tokenizer is a new feature added April 14 2024.
Re-run python train_gpt2.py to write it


WARNING: Failed to open the tokenizer file gpt2_tokenizer.bin
The Tokenizer is a new feature added April 14 2024.
Re-run python train_gpt2.py to write it


WARNING: Failed to open the tokenizer file gpt2_tokenizer.bin
The Tokenizer is a new feature added April 14 2024.
Re-run python train_gpt2.py to write it


WARNING: Failed to open the tokenizer file gpt2_tokenizer.bin
The Tokenizer is a new feature added April 14 2024.
Re-run python train_gpt2.py to write it


WARNING: Failed to open the tokenizer file gpt2_tokenizer.bin
The Tokenizer is a new feature added April 14 2024.
Re-run python train_gpt2.py to write it


WARNING: Failed to open the tokenizer file gpt2_tokenizer.bin
The Tokenizer is a new feature added April 14 2024.
Re-run python train_gpt2.py to write it

allocating 21216 MiB for activations
allocating 59 MiB for AdamW optimizer state m
allocating 59 MiB for AdamW optimizer state v
allocating 59 MiB for master copy of params
device memory usage: 23273 MiB / 81050 MiB
memory per sequence: 331 MiB
-> estimated maximum batch size: 238
val loss 11.009205
[CUDNN ERROR] at file llmc/cudnn_att.cpp:205:
[cudnn_frontend] Error: No execution plans support the graph.
[CUDNN ERROR] at file llmc/cudnn_att.cpp:205:
[cudnn_frontend] Error: No execution plans support the graph.
[CUDNN ERROR] at file llmc/cudnn_att.cpp:205:
[cudnn_frontend] Error: No execution plans support the graph.
[CUDNN ERROR] at file llmc/cudnn_att.cpp:205:
[cudnn_frontend] Error: No execution plans support the graph.
[CUDNN ERROR] at file llmc/cudnn_att.cpp:205:
[cudnn_frontend] Error: No execution plans support the graph.
[CUDNN ERROR] at file llmc/cudnn_att.cpp:205:
[cudnn_frontend] Error: No execution plans support the graph.
[CUDNN ERROR] at file llmc/cudnn_att.cpp:205:
[cudnn_frontend] Error: No execution plans support the graph.
[CUDNN ERROR] at file llmc/cudnn_att.cpp:205:
[cudnn_frontend] Error: No execution plans support the graph.

My program runs on the school's server, and because I don't have sudo privileges, I can only run the program in the container provided by the school. The CUDA version is 12.3, the cuDNN version is 8.9.7, and cuDNN-frontend is installed in the home directory (~/)