karpathy / llm.c

LLM training in simple, raw C/CUDA

Repository from Github https://github.comkarpathy/llm.cRepository from Github https://github.comkarpathy/llm.c

ERROR on the AMD GPU

Lookforworld opened this issue · comments

@anthonix
When I following the example code, an ERROR occurred:

root@0a98733c1ebb:/home/llm.c# ./train_gpt2amd 
Multi-GPU support is disabled. Using a single GPU.
+-----------------------+----------------------------------------------------+
| Parameter             | Value                                              |
+-----------------------+----------------------------------------------------+
| train data pattern    | dev/data/tinyshakespeare/tiny_shakespeare_train.bin |
| val data pattern      | dev/data/tinyshakespeare/tiny_shakespeare_val.bin  |
| output log dir        | NULL                                               |
| checkpoint_every      | 0                                                  |
| resume                | 0                                                  |
| micro batch size B    | 4                                                  |
| sequence length T     | 1024                                               |
| total batch size      | 4096                                               |
| learning rate (LR)    | 3.000000e-04                                       |
| warmup iterations     | 0                                                  |
| final LR fraction     | 1.000000e+00                                       |
| weight decay          | 0.000000e+00                                       |
| max_steps             | -1                                                 |
| val_loss_every        | 20                                                 |
| val_max_steps         | 20                                                 |
| sample_every          | 20                                                 |
| genT                  | 64                                                 |
| overfit_single_batch  | 0                                                  |
| use_master_weights    | enabled                                            |
| recompute             | 1                                                  |
+-----------------------+----------------------------------------------------+
| device                | Radeon RX 7900 XTX                                 |
| precision             | BF16                                               |
+-----------------------+----------------------------------------------------+
| load_filename         | gpt2_124M_bf16.bin                                 |
| max_sequence_length T | 1024                                               |
| vocab_size V          | 50257                                              |
| padded_vocab_size Vp  | 50304                                              |
| num_layers L          | 12                                                 |
| num_heads NH          | 12                                                 |
| channels C            | 768                                                |
| num_parameters        | 124475904                                          |
+-----------------------+----------------------------------------------------+
| train_num_batches     | 74                                                 |
| val_num_batches       | 20                                                 |
+-----------------------+----------------------------------------------------+
| run hellaswag         | no                                                 |
+-----------------------+----------------------------------------------------+
| Zero Optimization is disabled                                              |
| num_processes         | 1                                                  |
| zero_stage            | 0                                                  |
+-----------------------+----------------------------------------------------+
HellaSwag eval not found at dev/data/hellaswag/hellaswag_val.bin, skipping its evaluation
You can run `python dev/data/hellaswag.py` to export and use it with `-h 1`.
num_parameters: 124475904 => bytes: 248951808
allocated 237 MiB for model parameters
batch_size B=4 * seq_len T=1024 * num_processes=1 and total_batch_size=4096
=> setting grad_accum_steps=1
allocating 2589 MiB for activations
[CUDA ERROR] at file train_gpt2.hip:1753:
the operation cannot be performed in the present state

GPU: 7900xtx
Env: Rocm docker

How to solve this problem?

Can you please try on bare metal? i.e., don't run inside docker.

Can you please try on bare metal? i.e., don't run inside docker.

@anthonix
I've tested it and it's the same error when running on the host.
At first I thought there might be some dependencies missing on my host, so I switched to docker, but error remains.
That's why I'm submitting this issue.

Anything else running on the GPU?

DM me on Discord (@anthonix on Karparthy's server), will be faster to chat about it there, and the loop back here when we've figured it out.

Its probably better if we work this out over on the AMD fork so as not to distract folks here.

I just enabled issues and discussions over there, can you please close this issue and the related discussion #534, and reopen over on the AMD fork?