ERROR on the AMD GPU

Question

ERROR on the AMD GPU

Lookforworld opened this issue a year ago · comments

@anthonix
When I following the example code, an ERROR occurred:

root@0a98733c1ebb:/home/llm.c# ./train_gpt2amd 
Multi-GPU support is disabled. Using a single GPU.
+-----------------------+----------------------------------------------------+
| Parameter             | Value                                              |
+-----------------------+----------------------------------------------------+
| train data pattern    | dev/data/tinyshakespeare/tiny_shakespeare_train.bin |
| val data pattern      | dev/data/tinyshakespeare/tiny_shakespeare_val.bin  |
| output log dir        | NULL                                               |
| checkpoint_every      | 0                                                  |
| resume                | 0                                                  |
| micro batch size B    | 4                                                  |
| sequence length T     | 1024                                               |
| total batch size      | 4096                                               |
| learning rate (LR)    | 3.000000e-04                                       |
| warmup iterations     | 0                                                  |
| final LR fraction     | 1.000000e+00                                       |
| weight decay          | 0.000000e+00                                       |
| max_steps             | -1                                                 |
| val_loss_every        | 20                                                 |
| val_max_steps         | 20                                                 |
| sample_every          | 20                                                 |
| genT                  | 64                                                 |
| overfit_single_batch  | 0                                                  |
| use_master_weights    | enabled                                            |
| recompute             | 1                                                  |
+-----------------------+----------------------------------------------------+
| device                | Radeon RX 7900 XTX                                 |
| precision             | BF16                                               |
+-----------------------+----------------------------------------------------+
| load_filename         | gpt2_124M_bf16.bin                                 |
| max_sequence_length T | 1024                                               |
| vocab_size V          | 50257                                              |
| padded_vocab_size Vp  | 50304                                              |
| num_layers L          | 12                                                 |
| num_heads NH          | 12                                                 |
| channels C            | 768                                                |
| num_parameters        | 124475904                                          |
+-----------------------+----------------------------------------------------+
| train_num_batches     | 74                                                 |
| val_num_batches       | 20                                                 |
+-----------------------+----------------------------------------------------+
| run hellaswag         | no                                                 |
+-----------------------+----------------------------------------------------+
| Zero Optimization is disabled                                              |
| num_processes         | 1                                                  |
| zero_stage            | 0                                                  |
+-----------------------+----------------------------------------------------+
HellaSwag eval not found at dev/data/hellaswag/hellaswag_val.bin, skipping its evaluation
You can run `python dev/data/hellaswag.py` to export and use it with `-h 1`.
num_parameters: 124475904 => bytes: 248951808
allocated 237 MiB for model parameters
batch_size B=4 * seq_len T=1024 * num_processes=1 and total_batch_size=4096
=> setting grad_accum_steps=1
allocating 2589 MiB for activations
[CUDA ERROR] at file train_gpt2.hip:1753:
the operation cannot be performed in the present state

GPU: 7900xtx
Env: Rocm docker

How to solve this problem？

anthonix · Answer 1 · Tue Jun 04 2024 01:37:44 GMT+0800 (China Standard Time)

Can you please try on bare metal? i.e., don't run inside docker.

Lookforworld · Answer 2 · Tue Jun 04 2024 01:51:02 GMT+0800 (China Standard Time)

Can you please try on bare metal? i.e., don't run inside docker.

@anthonix
I've tested it and it's the same error when running on the host.
At first I thought there might be some dependencies missing on my host, so I switched to docker, but error remains.
That's why I'm submitting this issue.

anthonix · Answer 3 · Tue Jun 04 2024 02:27:06 GMT+0800 (China Standard Time)

Anything else running on the GPU?

DM me on Discord (@anthonix on Karparthy's server), will be faster to chat about it there, and the loop back here when we've figured it out.

anthonix · Answer 4 · Thu Jun 06 2024 00:28:08 GMT+0800 (China Standard Time)

Its probably better if we work this out over on the AMD fork so as not to distract folks here.

I just enabled issues and discussions over there, can you please close this issue and the related discussion #534, and reopen over on the AMD fork?