nvproxy: unknown control command 0x3d05

Question

nvproxy: unknown control command 0x3d05

thundergolfer opened this issue 21 days ago · comments

Description

Doing multi-GPU training on A100s and seeing that on gVisor it gets stuck. Tried the below program on the following GPUs within Modal:

A100 40 GiB (Oracle Cloud) ❌
H100 (a3-highgpu-8g) ❌
A10G ✔️
T4 ✔️

Both the H100 and A100 run into these unknown control commands:

W0509 01:16:28.218428  1772489 frontend.go:521] [   6:  20] nvproxy: unknown control command 0x3d05 (paramsSize=24)
W0509 01:16:28.218780  1772489 frontend.go:521] [   5:  22] nvproxy: unknown control command 0x3d05 (paramsSize=24)

Which is NV0000_CTRL_CMD_OS_UNIX_EXPORT_OBJECT_TO_FD -> https://github.com/NVIDIA/open-gpu-kernel-modules/blob/083cd9cf17ab95cd6f9fb50a5349c21eaa2f7d4b/src/common/sdk/nvidia/inc/ctrl/ctrl0000/ctrl0000unix.h#L146-L147

Steps to reproduce

FROM nvidia/cuda:12.2.0-devel-ubuntu20.04

RUN apt-get update && apt-get install --yes python3 python3-distutils clang wget vim
RUN wget https://bootstrap.pypa.io/get-pip.py
RUN python3 get-pip.py
RUN python3 -m pip install clang~=10.0.1 # must match version of `clang` installed above.
RUN python3 -m pip install --ignore-installed torch torchvision lightning numpy memory_profiler

COPY <<EOF repro.py
print("Hello from inside container.")
import psutil
current_process = psutil.Process()
parent_process = current_process.parent()
print(f"Processes: {current_process=} {parent_process=}")

import time
import torch
import torch.nn as nn
import torch.nn.functional as F
import lightning as L

from memory_profiler import profile

from torchvision.datasets import CIFAR100
from torchvision import transforms
from torchvision import models
from torch.utils.data import DataLoader

class MagixNet(L.LightningModule):
	def __init__(self, nbr_cat):
	    super().__init__()

	    module = models.resnet50(weights=models.ResNet50_Weights.DEFAULT)
	    module.fc = nn.Linear(2048, nbr_cat)

	    self.module = module


	def forward(self, x):
	    return self.module(x)

	def training_step(self, batch, batch_idx):
	    x, y = batch
	    y_hat = self(x)
	    loss = F.cross_entropy(y_hat, y)
	    return loss

	def configure_optimizers(self):
	    return torch.optim.Adam(self.parameters(), lr=0.02)

def prepare_data():
    pipeline = transforms.Compose([
        transforms.Resize((224, 224)),
        transforms.ToTensor(),
    ])

    train_ds = CIFAR100('data', train=True, download=True, transform=pipeline)
    train_dl = DataLoader(train_ds, batch_size=128, num_workers=4)

    val_ds = CIFAR100('data', train=False, download=True, transform=pipeline)
    val_dl = DataLoader(val_ds, batch_size=128, num_workers=4)

    return train_dl, val_dl

torch.set_float32_matmul_precision('medium')
train_dl, val_dl = prepare_data()
model = MagixNet(100)
trainer = L.Trainer(max_epochs=1, strategy="ddp_notebook")

start  = time.time()
trainer.fit(model, train_dl, val_dl)
print(f"Training duration (seconds): {time.time() - start:.2f}")
EOF

ENTRYPOINT ["python3", "repro.py"]

runsc version

`runsc version 6e61813c1b37
spec: 1.1.0-rc.1`

docker version (if using docker)

N/A

uname

No response

kubectl (if using Kubernetes)

No response

repo state (if built from source)

No response

runsc debug logs (if available)

Jonathon Belotti · Answer 1 · Thu May 09 2024 11:26:02 GMT+0800 (China Standard Time)

The reproduction program is almost identical to the one in #9827, which is why I revisited that issue's test.

Ayush Ranjan · Answer 2 · Thu May 09 2024 23:31:33 GMT+0800 (China Standard Time)

This seems to be running fine for me on an A100-40GB machine in GCE on driver version 535.104.05:

(base) ayushranjan_google_com@a100:~/issue10413$ docker run --runtime=runsc --shm-size=128g --gpus=all --rm issue10413:latest
Hello from inside container.
Processes: current_process=psutil.Process(pid=1, name='python3', status='running', started='15:24:33') parent_process=None
Downloading https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz to data/cifar-100-python.tar.gz
100%|██████████| 169001437/169001437 [00:18<00:00, 9193099.59it/s] 
Extracting data/cifar-100-python.tar.gz to data
Files already downloaded and verified
Downloading: "https://download.pytorch.org/models/resnet50-11ad3fa6.pth" to /root/.cache/torch/hub/checkpoints/resnet50-11ad3fa6.pth
100%|██████████| 97.8M/97.8M [00:00<00:00, 156MB/s]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/usr/local/lib/python3.8/dist-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py:75: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `lightning.pytorch` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
/usr/local/lib/python3.8/dist-packages/lightning/pytorch/trainer/configuration_validator.py:72: You passed in a `val_dataloader` but have no `validation_step`. Skipping val loop.
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

Missing logger folder: /lightning_logs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name   | Type   | Params
----------------------------------
0 | module | ResNet | 23.7 M
----------------------------------
23.7 M    Trainable params
0         Non-trainable params
23.7 M    Total params
94.852    Total estimated model params size (MB)
Epoch 0: 100%|██████████| 391/391 [01:08<00:00,  5.68it/s, v_num=0]`Trainer.fit` stopped: `max_epochs=1` reached.
Epoch 0: 100%|██████████| 391/391 [01:09<00:00,  5.62it/s, v_num=0]

-------------------------------------------------------------------------------
repro.py 63 <module>
print(f"Training duration (seconds): {time.time() - start:2.f}")

ValueError:
Format specifier missing precision

(base) ayushranjan_google_com@a100:~/issue10413$ nvidia-smi
Thu May  9 15:27:46 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P0              49W / 400W |      4MiB / 40960MiB |     27%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Please note:

There seems to be an issue with the last print statement in repro.py. Other than that, the application seems to work fine.
I am using --shm-size=128g as per #9827 (comment).
The debug logs don't have any nvproxy: unknown lines.

So maybe you are using a different driver version? Or maybe something to do with the Oracle Cloud environment?

Jonathon Belotti · Answer 3 · Thu May 09 2024 23:49:21 GMT+0800 (China Standard Time)

Oh yep, fixed that in the original description.
Our --shm-size is also set very large. On Oracle workers it's around 1657GB.

We have Driver Version: 535.129.03 CUDA Version: 12.2. Sorry should have included that in the issue originally!

On H100 worker:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H100 80GB HBM3          On  | 00000000:04:00.0 Off |                    0 |
| N/A   36C    P0             113W / 700W |  72459MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          On  | 00000000:05:00.0 Off |                    0 |
| N/A   34C    P0             117W / 700W |  72507MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA H100 80GB HBM3          On  | 00000000:0A:00.0 Off |                    0 |
| N/A   35C    P0             114W / 700W |  72507MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA H100 80GB HBM3          On  | 00000000:0B:00.0 Off |                    0 |
| N/A   33C    P0             111W / 700W |  72587MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA H100 80GB HBM3          On  | 00000000:84:00.0 Off |                    0 |
| N/A   60C    P0             578W / 700W |  71533MiB / 81559MiB |     95%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA H100 80GB HBM3          On  | 00000000:85:00.0 Off |                    0 |
| N/A   34C    P0             112W / 700W |    841MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA H100 80GB HBM3          On  | 00000000:8A:00.0 Off |                    0 |
| N/A   34C    P0             114W / 700W |  16463MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA H100 80GB HBM3          On  | 00000000:8B:00.0 Off |                    0 |
| N/A   34C    P0             111W / 700W |   2405MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    759790      C   /opt/conda/bin/python3.10                 72446MiB |

We use the same driver version across all GPU workers.

Ayush Ranjan · Answer 4 · Fri May 10 2024 00:14:03 GMT+0800 (China Standard Time)

Updated driver version and still can not repro the failure on my GCE VM:

(base) ayushranjan_google_com@a100:~/issue10413$ docker run --runtime=runsc --shm-size=128g --gpus=all --rm issue10413:latest
Hello from inside container.
Processes: current_process=psutil.Process(pid=1, name='python3', status='running', started='16:01:41') parent_process=None
Downloading https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz to data/cifar-100-python.tar.gz
100%|██████████| 169001437/169001437 [00:18<00:00, 9140159.09it/s] 
Extracting data/cifar-100-python.tar.gz to data
Files already downloaded and verified
Downloading: "https://download.pytorch.org/models/resnet50-11ad3fa6.pth" to /root/.cache/torch/hub/checkpoints/resnet50-11ad3fa6.pth
100%|██████████| 97.8M/97.8M [00:01<00:00, 74.1MB/s]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/usr/local/lib/python3.8/dist-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py:75: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `lightning.pytorch` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
/usr/local/lib/python3.8/dist-packages/lightning/pytorch/trainer/configuration_validator.py:72: You passed in a `val_dataloader` but have no `validation_step`. Skipping val loop.
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

Missing logger folder: /lightning_logs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name   | Type   | Params
----------------------------------
0 | module | ResNet | 23.7 M
----------------------------------
23.7 M    Trainable params
0         Non-trainable params
23.7 M    Total params
94.852    Total estimated model params size (MB)
Epoch 0: 100%|██████████| 391/391 [01:08<00:00,  5.68it/s, v_num=0]`Trainer.fit` stopped: `max_epochs=1` reached.
Epoch 0: 100%|██████████| 391/391 [01:09<00:00,  5.62it/s, v_num=0]
Training duration (seconds): 72.35

Surprisingly, this workload gets stuck without gVisor. I will add NV0000_CTRL_CMD_OS_UNIX_EXPORT_OBJECT_TO_FD to nvproxy though, hopefully it resolves whatever failure you are seeing.

Jonathon Belotti · Answer 5 · Fri May 10 2024 00:17:18 GMT+0800 (China Standard Time)

Surprisingly, this workload gets stuck without gVisor.

Interesting. This may be the same problem as in #9827 where the test got stuck on runc.

The program doesn't get stuck on runc in Modal. It completes in around 60s. A 72.35 second completion for gVisor lines up with that.

I will add NV0000_CTRL_CMD_OS_UNIX_EXPORT_OBJECT_TO_FD to nvproxy though, hopefully it resolves whatever failure you are seeing.

🙏

Ayush Ranjan · Answer 6 · Fri May 10 2024 06:02:30 GMT+0800 (China Standard Time)

@thundergolfer Let me know if e9b3218 fixes the issue. If so, please close this.