nvproxy: unknown control command 0x3d05
thundergolfer opened this issue · comments
Description
Doing multi-GPU training on A100s and seeing that on gVisor it gets stuck. Tried the below program on the following GPUs within Modal:
- A100 40 GiB (Oracle Cloud) ❌
- H100 (a3-highgpu-8g) ❌
- A10G ✔️
- T4 ✔️
Both the H100 and A100 run into these unknown control commands:
W0509 01:16:28.218428 1772489 frontend.go:521] [ 6: 20] nvproxy: unknown control command 0x3d05 (paramsSize=24)
W0509 01:16:28.218780 1772489 frontend.go:521] [ 5: 22] nvproxy: unknown control command 0x3d05 (paramsSize=24)
Which is NV0000_CTRL_CMD_OS_UNIX_EXPORT_OBJECT_TO_FD
-> https://github.com/NVIDIA/open-gpu-kernel-modules/blob/083cd9cf17ab95cd6f9fb50a5349c21eaa2f7d4b/src/common/sdk/nvidia/inc/ctrl/ctrl0000/ctrl0000unix.h#L146-L147
Steps to reproduce
FROM nvidia/cuda:12.2.0-devel-ubuntu20.04
RUN apt-get update && apt-get install --yes python3 python3-distutils clang wget vim
RUN wget https://bootstrap.pypa.io/get-pip.py
RUN python3 get-pip.py
RUN python3 -m pip install clang~=10.0.1 # must match version of `clang` installed above.
RUN python3 -m pip install --ignore-installed torch torchvision lightning numpy memory_profiler
COPY <<EOF repro.py
print("Hello from inside container.")
import psutil
current_process = psutil.Process()
parent_process = current_process.parent()
print(f"Processes: {current_process=} {parent_process=}")
import time
import torch
import torch.nn as nn
import torch.nn.functional as F
import lightning as L
from memory_profiler import profile
from torchvision.datasets import CIFAR100
from torchvision import transforms
from torchvision import models
from torch.utils.data import DataLoader
class MagixNet(L.LightningModule):
def __init__(self, nbr_cat):
super().__init__()
module = models.resnet50(weights=models.ResNet50_Weights.DEFAULT)
module.fc = nn.Linear(2048, nbr_cat)
self.module = module
def forward(self, x):
return self.module(x)
def training_step(self, batch, batch_idx):
x, y = batch
y_hat = self(x)
loss = F.cross_entropy(y_hat, y)
return loss
def configure_optimizers(self):
return torch.optim.Adam(self.parameters(), lr=0.02)
def prepare_data():
pipeline = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
])
train_ds = CIFAR100('data', train=True, download=True, transform=pipeline)
train_dl = DataLoader(train_ds, batch_size=128, num_workers=4)
val_ds = CIFAR100('data', train=False, download=True, transform=pipeline)
val_dl = DataLoader(val_ds, batch_size=128, num_workers=4)
return train_dl, val_dl
torch.set_float32_matmul_precision('medium')
train_dl, val_dl = prepare_data()
model = MagixNet(100)
trainer = L.Trainer(max_epochs=1, strategy="ddp_notebook")
start = time.time()
trainer.fit(model, train_dl, val_dl)
print(f"Training duration (seconds): {time.time() - start:.2f}")
EOF
ENTRYPOINT ["python3", "repro.py"]
runsc version
`runsc version 6e61813c1b37
spec: 1.1.0-rc.1`
docker version (if using docker)
N/A
uname
No response
kubectl (if using Kubernetes)
No response
repo state (if built from source)
No response
runsc debug logs (if available)
The reproduction program is almost identical to the one in #9827, which is why I revisited that issue's test.
This seems to be running fine for me on an A100-40GB machine in GCE on driver version 535.104.05
:
(base) ayushranjan_google_com@a100:~/issue10413$ docker run --runtime=runsc --shm-size=128g --gpus=all --rm issue10413:latest
Hello from inside container.
Processes: current_process=psutil.Process(pid=1, name='python3', status='running', started='15:24:33') parent_process=None
Downloading https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz to data/cifar-100-python.tar.gz
100%|██████████| 169001437/169001437 [00:18<00:00, 9193099.59it/s]
Extracting data/cifar-100-python.tar.gz to data
Files already downloaded and verified
Downloading: "https://download.pytorch.org/models/resnet50-11ad3fa6.pth" to /root/.cache/torch/hub/checkpoints/resnet50-11ad3fa6.pth
100%|██████████| 97.8M/97.8M [00:00<00:00, 156MB/s]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/usr/local/lib/python3.8/dist-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py:75: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `lightning.pytorch` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
/usr/local/lib/python3.8/dist-packages/lightning/pytorch/trainer/configuration_validator.py:72: You passed in a `val_dataloader` but have no `validation_step`. Skipping val loop.
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------
Missing logger folder: /lightning_logs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
| Name | Type | Params
----------------------------------
0 | module | ResNet | 23.7 M
----------------------------------
23.7 M Trainable params
0 Non-trainable params
23.7 M Total params
94.852 Total estimated model params size (MB)
Epoch 0: 100%|██████████| 391/391 [01:08<00:00, 5.68it/s, v_num=0]`Trainer.fit` stopped: `max_epochs=1` reached.
Epoch 0: 100%|██████████| 391/391 [01:09<00:00, 5.62it/s, v_num=0]
-------------------------------------------------------------------------------
repro.py 63 <module>
print(f"Training duration (seconds): {time.time() - start:2.f}")
ValueError:
Format specifier missing precision
(base) ayushranjan_google_com@a100:~/issue10413$ nvidia-smi
Thu May 9 15:27:46 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM4-40GB Off | 00000000:00:04.0 Off | 0 |
| N/A 35C P0 49W / 400W | 4MiB / 40960MiB | 27% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
Please note:
- There seems to be an issue with the last print statement in repro.py. Other than that, the application seems to work fine.
- I am using
--shm-size=128g
as per #9827 (comment). - The debug logs don't have any
nvproxy: unknown
lines.
So maybe you are using a different driver version? Or maybe something to do with the Oracle Cloud environment?
- Oh yep, fixed that in the original description.
- Our
--shm-size
is also set very large. On Oracle workers it's around 1657GB.
We have Driver Version: 535.129.03 CUDA Version: 12.2
. Sorry should have included that in the issue originally!
On H100 worker:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA H100 80GB HBM3 On | 00000000:04:00.0 Off | 0 |
| N/A 36C P0 113W / 700W | 72459MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA H100 80GB HBM3 On | 00000000:05:00.0 Off | 0 |
| N/A 34C P0 117W / 700W | 72507MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA H100 80GB HBM3 On | 00000000:0A:00.0 Off | 0 |
| N/A 35C P0 114W / 700W | 72507MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA H100 80GB HBM3 On | 00000000:0B:00.0 Off | 0 |
| N/A 33C P0 111W / 700W | 72587MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 4 NVIDIA H100 80GB HBM3 On | 00000000:84:00.0 Off | 0 |
| N/A 60C P0 578W / 700W | 71533MiB / 81559MiB | 95% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 5 NVIDIA H100 80GB HBM3 On | 00000000:85:00.0 Off | 0 |
| N/A 34C P0 112W / 700W | 841MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 6 NVIDIA H100 80GB HBM3 On | 00000000:8A:00.0 Off | 0 |
| N/A 34C P0 114W / 700W | 16463MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 7 NVIDIA H100 80GB HBM3 On | 00000000:8B:00.0 Off | 0 |
| N/A 34C P0 111W / 700W | 2405MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 759790 C /opt/conda/bin/python3.10 72446MiB |
We use the same driver version across all GPU workers.
Updated driver version and still can not repro the failure on my GCE VM:
(base) ayushranjan_google_com@a100:~/issue10413$ docker run --runtime=runsc --shm-size=128g --gpus=all --rm issue10413:latest
Hello from inside container.
Processes: current_process=psutil.Process(pid=1, name='python3', status='running', started='16:01:41') parent_process=None
Downloading https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz to data/cifar-100-python.tar.gz
100%|██████████| 169001437/169001437 [00:18<00:00, 9140159.09it/s]
Extracting data/cifar-100-python.tar.gz to data
Files already downloaded and verified
Downloading: "https://download.pytorch.org/models/resnet50-11ad3fa6.pth" to /root/.cache/torch/hub/checkpoints/resnet50-11ad3fa6.pth
100%|██████████| 97.8M/97.8M [00:01<00:00, 74.1MB/s]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/usr/local/lib/python3.8/dist-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py:75: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `lightning.pytorch` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
/usr/local/lib/python3.8/dist-packages/lightning/pytorch/trainer/configuration_validator.py:72: You passed in a `val_dataloader` but have no `validation_step`. Skipping val loop.
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------
Missing logger folder: /lightning_logs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
| Name | Type | Params
----------------------------------
0 | module | ResNet | 23.7 M
----------------------------------
23.7 M Trainable params
0 Non-trainable params
23.7 M Total params
94.852 Total estimated model params size (MB)
Epoch 0: 100%|██████████| 391/391 [01:08<00:00, 5.68it/s, v_num=0]`Trainer.fit` stopped: `max_epochs=1` reached.
Epoch 0: 100%|██████████| 391/391 [01:09<00:00, 5.62it/s, v_num=0]
Training duration (seconds): 72.35
Surprisingly, this workload gets stuck without gVisor. I will add NV0000_CTRL_CMD_OS_UNIX_EXPORT_OBJECT_TO_FD
to nvproxy though, hopefully it resolves whatever failure you are seeing.
Surprisingly, this workload gets stuck without gVisor.
Interesting. This may be the same problem as in #9827 where the test got stuck on runc
.
The program doesn't get stuck on runc
in Modal. It completes in around 60s. A 72.35 second completion for gVisor lines up with that.
I will add NV0000_CTRL_CMD_OS_UNIX_EXPORT_OBJECT_TO_FD to nvproxy though, hopefully it resolves whatever failure you are seeing.
🙏
@thundergolfer Let me know if e9b3218 fixes the issue. If so, please close this.