MPI run with 8 GPU fails
msharmavikram opened this issue · comments
mpirun -np 8 ./train_gpt2cu
+-----------------------+----------------------------------------------------+
| Parameter | Value |
+-----------------------+----------------------------------------------------+
| train data pattern | dev/data/tinyshakespeare/tiny_shakespeare_train.bin |
| val data pattern | dev/data/tinyshakespeare/tiny_shakespeare_val.bin |
| output log dir | NULL |
| checkpoint_every | 0 |
| resume | 0 |
| micro batch size B | 4 |
| sequence length T | 1024 |
| total batch size | 32768 |
| LR scheduler | cosine |
| learning rate (LR) | 3.000000e-04 |
| warmup iterations | 0 |
| final LR fraction | 1.000000e+00 |
| weight decay | 0.000000e+00 |
| skip update lossz | 0.000000 |
| skip update gradz | 0.000000 |
| max_steps | -1 |
| val_loss_every | 20 |
| val_max_steps | 20 |
| sample_every | 20 |
| genT | 64 |
| overfit_single_batch | 0 |
| use_master_weights | enabled |
| gelu_fusion | 0 |
| recompute | 1 |
+-----------------------+----------------------------------------------------+
| device | NVIDIA A100-SXM4-80GB |
| peak TFlops | 312.0 |
| precision | BF16 |
+-----------------------+----------------------------------------------------+
train_gpt2cu: llmc/dataloader.h:186: void dataloader_init(DataLoader*, const char*, size_t, size_t, int, int, int): Assertion `shard_ntok >= (int64_t) (num_processes * B * T + 1)' failed.
[149-130-218-240:10951] *** Process received signal ***
[149-130-218-240:10951] Signal: Aborted (6)
[149-130-218-240:10951] Signal code: (-6)
[149-130-218-240:10951] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7fe612442520]
[149-130-218-240:10951] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7fe6124969fc]
[149-130-218-240:10951] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7fe612442476]
[149-130-218-240:10951] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7fe6124287f3]
[149-130-218-240:10951] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2871b)[0x7fe61242871b]
[149-130-218-240:10951] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x39e96)[0x7fe612439e96]
[149-130-218-240:10951] [ 6] ./train_gpt2cu(+0x17762)[0x55f5ea98f762]
[149-130-218-240:10951] [ 7] ./train_gpt2cu(+0xf120)[0x55f5ea987120]
[149-130-218-240:10951] [ 8] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7fe612429d90]
[149-130-218-240:10951] [ 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7fe612429e40]
[149-130-218-240:10951] [10] ./train_gpt2cu(+0x13275)[0x55f5ea98b275]
[149-130-218-240:10951] *** End of error message ***
train_gpt2cu: llmc/dataloader.h:186: void dataloader_init(DataLoader*, const char*, size_t, size_t, int, int, int): Assertion `shard_ntok >= (int64_t) (num_processes * B * T + 1)' failed.
[149-130-218-240:10949] *** Process received signal ***
[149-130-218-240:10949] Signal: Aborted (6)
[149-130-218-240:10949] Signal code: (-6)
[149-130-218-240:10949] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f4969642520]
[149-130-218-240:10949] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f49696969fc]
[149-130-218-240:10949] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f4969642476]
[149-130-218-240:10949] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f49696287f3]
[149-130-218-240:10949] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2871b)[0x7f496962871b]
[149-130-218-240:10949] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x39e96)[0x7f4969639e96]
[149-130-218-240:10949] [ 6] ./train_gpt2cu(+0x17762)[0x55756a4e6762]
[149-130-218-240:10949] [ 7] ./train_gpt2cu(+0xf120)[0x55756a4de120]
[149-130-218-240:10949] [ 8] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f4969629d90]
[149-130-218-240:10949] [ 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f4969629e40]
[149-130-218-240:10949] [10] ./train_gpt2cu(+0x13275)[0x55756a4e2275]
[149-130-218-240:10949] *** End of error message ***
train_gpt2cu: llmc/dataloader.h:186: void dataloader_init(DataLoader*, const char*, size_t, size_t, int, int, int): Assertion `shard_ntok >= (int64_t) (num_processes * B * T + 1)' failed.
[149-130-218-240:10947] *** Process received signal ***
[149-130-218-240:10947] Signal: Aborted (6)
[149-130-218-240:10947] Signal code: (-6)
[149-130-218-240:10947] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7fd0d6042520]
[149-130-218-240:10947] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7fd0d60969fc]
[149-130-218-240:10947] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7fd0d6042476]
[149-130-218-240:10947] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7fd0d60287f3]
[149-130-218-240:10947] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2871b)[0x7fd0d602871b]
[149-130-218-240:10947] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x39e96)[0x7fd0d6039e96]
[149-130-218-240:10947] [ 6] ./train_gpt2cu(+0x17762)[0x55b68d44b762]
[149-130-218-240:10947] [ 7] ./train_gpt2cu(+0xf120)[0x55b68d443120]
[149-130-218-240:10947] [ 8] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7fd0d6029d90]
[149-130-218-240:10947] [ 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7fd0d6029e40]
[149-130-218-240:10947] [10] ./train_gpt2cu(+0x13275)[0x55b68d447275]
[149-130-218-240:10947] *** End of error message ***
train_gpt2cu: llmc/dataloader.h:186: void dataloader_init(DataLoader*, const char*, size_t, size_t, int, int, int): Assertion `shard_ntok >= (int64_t) (num_processes * B * T + 1)' failed.
[149-130-218-240:10948] *** Process received signal ***
[149-130-218-240:10948] Signal: Aborted (6)
[149-130-218-240:10948] Signal code: (-6)
[149-130-218-240:10948] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7fcbac242520]
[149-130-218-240:10948] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7fcbac2969fc]
[149-130-218-240:10948] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7fcbac242476]
[149-130-218-240:10948] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7fcbac2287f3]
[149-130-218-240:10948] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2871b)[0x7fcbac22871b]
[149-130-218-240:10948] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x39e96)[0x7fcbac239e96]
[149-130-218-240:10948] [ 6] ./train_gpt2cu(+0x17762)[0x55c4774ce762]
[149-130-218-240:10948] [ 7] ./train_gpt2cu(+0xf120)[0x55c4774c6120]
[149-130-218-240:10948] [ 8] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7fcbac229d90]
[149-130-218-240:10948] [ 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7fcbac229e40]
[149-130-218-240:10948] [10] ./train_gpt2cu(+0x13275)[0x55c4774ca275]
[149-130-218-240:10948] *** End of error message ***
train_gpt2cu: llmc/dataloader.h:186: void dataloader_init(DataLoader*, const char*, size_t, size_t, int, int, int): Assertion `shard_ntok >= (int64_t) (num_processes * B * T + 1)' failed.
[149-130-218-240:10950] *** Process received signal ***
[149-130-218-240:10950] Signal: Aborted (6)
[149-130-218-240:10950] Signal code: (-6)
[149-130-218-240:10950] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7faae5a42520]
[149-130-218-240:10950] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7faae5a969fc]
[149-130-218-240:10950] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7faae5a42476]
[149-130-218-240:10950] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7faae5a287f3]
[149-130-218-240:10950] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2871b)[0x7faae5a2871b]
[149-130-218-240:10950] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x39e96)[0x7faae5a39e96]
[149-130-218-240:10950] [ 6] ./train_gpt2cu(+0x17762)[0x562edaec8762]
[149-130-218-240:10950] [ 7] ./train_gpt2cu(+0xf120)[0x562edaec0120]
[149-130-218-240:10950] [ 8] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7faae5a29d90]
[149-130-218-240:10950] [ 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7faae5a29e40]
[149-130-218-240:10950] [10] ./train_gpt2cu(+0x13275)[0x562edaec4275]
[149-130-218-240:10950] *** End of error message ***
train_gpt2cu: llmc/dataloader.h:186: void dataloader_init(DataLoader*, const char*, size_t, size_t, int, int, int): Assertion `shard_ntok >= (int64_t) (num_processes * B * T + 1)' failed.
[149-130-218-240:10945] *** Process received signal ***
[149-130-218-240:10945] Signal: Aborted (6)
[149-130-218-240:10945] Signal code: (-6)
[149-130-218-240:10945] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7fe034642520]
[149-130-218-240:10945] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7fe0346969fc]
[149-130-218-240:10945] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7fe034642476]
[149-130-218-240:10945] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7fe0346287f3]
[149-130-218-240:10945] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2871b)[0x7fe03462871b]
[149-130-218-240:10945] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x39e96)[0x7fe034639e96]
[149-130-218-240:10945] [ 6] ./train_gpt2cu(+0x17762)[0x561977d15762]
[149-130-218-240:10945] [ 7] ./train_gpt2cu(+0xf120)[0x561977d0d120]
[149-130-218-240:10945] [ 8] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7fe034629d90]
[149-130-218-240:10945] [ 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7fe034629e40]
[149-130-218-240:10945] [10] ./train_gpt2cu(+0x13275)[0x561977d11275]
[149-130-218-240:10945] *** End of error message ***
train_gpt2cu: llmc/dataloader.h:186: void dataloader_init(DataLoader*, const char*, size_t, size_t, int, int, int): Assertion `shard_ntok >= (int64_t) (num_processes * B * T + 1)' failed.
[149-130-218-240:10946] *** Process received signal ***
[149-130-218-240:10946] Signal: Aborted (6)
[149-130-218-240:10946] Signal code: (-6)
[149-130-218-240:10946] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f4bd8842520]
[149-130-218-240:10946] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f4bd88969fc]
[149-130-218-240:10946] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f4bd8842476]
[149-130-218-240:10946] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f4bd88287f3]
[149-130-218-240:10946] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2871b)[0x7f4bd882871b]
[149-130-218-240:10946] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x39e96)[0x7f4bd8839e96]
[149-130-218-240:10946] [ 6] ./train_gpt2cu(+0x17762)[0x5637c07ba762]
[149-130-218-240:10946] [ 7] ./train_gpt2cu(+0xf120)[0x5637c07b2120]
[149-130-218-240:10946] [ 8] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f4bd8829d90]
[149-130-218-240:10946] [ 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f4bd8829e40]
[149-130-218-240:10946] [10] ./train_gpt2cu(+0x13275)[0x5637c07b6275]
[149-130-218-240:10946] *** End of error message ***
| weight init method | gpt2_124M_bf16.bin |
| max_sequence_length T | 1024 |
| vocab_size V | 50257 |
| padded_vocab_size Vp | 50304 |
| num_layers L | 12 |
| num_heads NH | 12 |
| channels C | 768 |
| num_parameters | 124475904 |
+-----------------------+----------------------------------------------------+
train_gpt2cu: llmc/dataloader.h:186: void dataloader_init(DataLoader*, const char*, size_t, size_t, int, int, int): Assertion `shard_ntok >= (int64_t) (num_processes * B * T + 1)' failed.
[149-130-218-240:10944] *** Process received signal ***
[149-130-218-240:10944] Signal: Aborted (6)
[149-130-218-240:10944] Signal code: (-6)
[149-130-218-240:10944] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f21acc42520]
[149-130-218-240:10944] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f21acc969fc]
[149-130-218-240:10944] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f21acc42476]
[149-130-218-240:10944] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f21acc287f3]
[149-130-218-240:10944] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2871b)[0x7f21acc2871b]
[149-130-218-240:10944] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x39e96)[0x7f21acc39e96]
[149-130-218-240:10944] [ 6] ./train_gpt2cu(+0x17762)[0x55d509142762]
[149-130-218-240:10944] [ 7] ./train_gpt2cu(+0xf120)[0x55d50913a120]
[149-130-218-240:10944] [ 8] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f21acc29d90]
[149-130-218-240:10944] [ 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f21acc29e40]
[149-130-218-240:10944] [10] ./train_gpt2cu(+0x13275)[0x55d50913e275]
[149-130-218-240:10944] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 7 with PID 0 on node 149-130-218-240 exited on signal 6 (Aborted).
MPI runs with 4 or 6 GPUs works just fine.
I am running this on CUDA 12.2 version - without cuDNN on Lamdhalabs cloud.
This assertion suggest you are using a small dataset:
Assertion `shard_ntok >= (int64_t) (num_processes * B * T + 1)' failed.
You can confirm if this is the case by trying a larger dataset (fineweb for example).