google / gvisor

Application Kernel for Containers

Home Page:https://gvisor.dev

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

nvproxy: unknown control command 0x3d05

thundergolfer opened this issue · comments

Description

Doing multi-GPU training on A100s and seeing that on gVisor it gets stuck. Tried the below program on the following GPUs within Modal:

  • A100 40 GiB (Oracle Cloud) ❌
  • H100 (a3-highgpu-8g) ❌
  • A10G ✔️
  • T4 ✔️

Both the H100 and A100 run into these unknown control commands:

W0509 01:16:28.218428  1772489 frontend.go:521] [   6:  20] nvproxy: unknown control command 0x3d05 (paramsSize=24)
W0509 01:16:28.218780  1772489 frontend.go:521] [   5:  22] nvproxy: unknown control command 0x3d05 (paramsSize=24)

Which is NV0000_CTRL_CMD_OS_UNIX_EXPORT_OBJECT_TO_FD -> https://github.com/NVIDIA/open-gpu-kernel-modules/blob/083cd9cf17ab95cd6f9fb50a5349c21eaa2f7d4b/src/common/sdk/nvidia/inc/ctrl/ctrl0000/ctrl0000unix.h#L146-L147

Steps to reproduce

FROM nvidia/cuda:12.2.0-devel-ubuntu20.04

RUN apt-get update && apt-get install --yes python3 python3-distutils clang wget vim
RUN wget https://bootstrap.pypa.io/get-pip.py
RUN python3 get-pip.py
RUN python3 -m pip install clang~=10.0.1 # must match version of `clang` installed above.
RUN python3 -m pip install --ignore-installed torch torchvision lightning numpy memory_profiler

COPY <<EOF repro.py
print("Hello from inside container.")
import psutil
current_process = psutil.Process()
parent_process = current_process.parent()
print(f"Processes: {current_process=} {parent_process=}")

import time
import torch
import torch.nn as nn
import torch.nn.functional as F
import lightning as L

from memory_profiler import profile

from torchvision.datasets import CIFAR100
from torchvision import transforms
from torchvision import models
from torch.utils.data import DataLoader

class MagixNet(L.LightningModule):
	def __init__(self, nbr_cat):
	    super().__init__()

	    module = models.resnet50(weights=models.ResNet50_Weights.DEFAULT)
	    module.fc = nn.Linear(2048, nbr_cat)

	    self.module = module


	def forward(self, x):
	    return self.module(x)

	def training_step(self, batch, batch_idx):
	    x, y = batch
	    y_hat = self(x)
	    loss = F.cross_entropy(y_hat, y)
	    return loss

	def configure_optimizers(self):
	    return torch.optim.Adam(self.parameters(), lr=0.02)

def prepare_data():
    pipeline = transforms.Compose([
        transforms.Resize((224, 224)),
        transforms.ToTensor(),
    ])

    train_ds = CIFAR100('data', train=True, download=True, transform=pipeline)
    train_dl = DataLoader(train_ds, batch_size=128, num_workers=4)

    val_ds = CIFAR100('data', train=False, download=True, transform=pipeline)
    val_dl = DataLoader(val_ds, batch_size=128, num_workers=4)

    return train_dl, val_dl

torch.set_float32_matmul_precision('medium')
train_dl, val_dl = prepare_data()
model = MagixNet(100)
trainer = L.Trainer(max_epochs=1, strategy="ddp_notebook")

start  = time.time()
trainer.fit(model, train_dl, val_dl)
print(f"Training duration (seconds): {time.time() - start:.2f}")
EOF

ENTRYPOINT ["python3", "repro.py"]

runsc version

`runsc version 6e61813c1b37
spec: 1.1.0-rc.1`

docker version (if using docker)

N/A

uname

No response

kubectl (if using Kubernetes)

No response

repo state (if built from source)

No response

runsc debug logs (if available)

The reproduction program is almost identical to the one in #9827, which is why I revisited that issue's test.

This seems to be running fine for me on an A100-40GB machine in GCE on driver version 535.104.05:

(base) ayushranjan_google_com@a100:~/issue10413$ docker run --runtime=runsc --shm-size=128g --gpus=all --rm issue10413:latest
Hello from inside container.
Processes: current_process=psutil.Process(pid=1, name='python3', status='running', started='15:24:33') parent_process=None
Downloading https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz to data/cifar-100-python.tar.gz
100%|██████████| 169001437/169001437 [00:18<00:00, 9193099.59it/s] 
Extracting data/cifar-100-python.tar.gz to data
Files already downloaded and verified
Downloading: "https://download.pytorch.org/models/resnet50-11ad3fa6.pth" to /root/.cache/torch/hub/checkpoints/resnet50-11ad3fa6.pth
100%|██████████| 97.8M/97.8M [00:00<00:00, 156MB/s]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/usr/local/lib/python3.8/dist-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py:75: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `lightning.pytorch` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
/usr/local/lib/python3.8/dist-packages/lightning/pytorch/trainer/configuration_validator.py:72: You passed in a `val_dataloader` but have no `validation_step`. Skipping val loop.
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

Missing logger folder: /lightning_logs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name   | Type   | Params
----------------------------------
0 | module | ResNet | 23.7 M
----------------------------------
23.7 M    Trainable params
0         Non-trainable params
23.7 M    Total params
94.852    Total estimated model params size (MB)
Epoch 0: 100%|██████████| 391/391 [01:08<00:00,  5.68it/s, v_num=0]`Trainer.fit` stopped: `max_epochs=1` reached.
Epoch 0: 100%|██████████| 391/391 [01:09<00:00,  5.62it/s, v_num=0]

-------------------------------------------------------------------------------
repro.py 63 <module>
print(f"Training duration (seconds): {time.time() - start:2.f}")

ValueError:
Format specifier missing precision
(base) ayushranjan_google_com@a100:~/issue10413$ nvidia-smi
Thu May  9 15:27:46 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P0              49W / 400W |      4MiB / 40960MiB |     27%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Please note:

  • There seems to be an issue with the last print statement in repro.py. Other than that, the application seems to work fine.
  • I am using --shm-size=128g as per #9827 (comment).
  • The debug logs don't have any nvproxy: unknown lines.

So maybe you are using a different driver version? Or maybe something to do with the Oracle Cloud environment?

  • Oh yep, fixed that in the original description.
  • Our --shm-size is also set very large. On Oracle workers it's around 1657GB.

We have Driver Version: 535.129.03 CUDA Version: 12.2. Sorry should have included that in the issue originally!

On H100 worker:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H100 80GB HBM3          On  | 00000000:04:00.0 Off |                    0 |
| N/A   36C    P0             113W / 700W |  72459MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          On  | 00000000:05:00.0 Off |                    0 |
| N/A   34C    P0             117W / 700W |  72507MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA H100 80GB HBM3          On  | 00000000:0A:00.0 Off |                    0 |
| N/A   35C    P0             114W / 700W |  72507MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA H100 80GB HBM3          On  | 00000000:0B:00.0 Off |                    0 |
| N/A   33C    P0             111W / 700W |  72587MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA H100 80GB HBM3          On  | 00000000:84:00.0 Off |                    0 |
| N/A   60C    P0             578W / 700W |  71533MiB / 81559MiB |     95%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA H100 80GB HBM3          On  | 00000000:85:00.0 Off |                    0 |
| N/A   34C    P0             112W / 700W |    841MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA H100 80GB HBM3          On  | 00000000:8A:00.0 Off |                    0 |
| N/A   34C    P0             114W / 700W |  16463MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA H100 80GB HBM3          On  | 00000000:8B:00.0 Off |                    0 |
| N/A   34C    P0             111W / 700W |   2405MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    759790      C   /opt/conda/bin/python3.10                 72446MiB |

We use the same driver version across all GPU workers.

Updated driver version and still can not repro the failure on my GCE VM:

(base) ayushranjan_google_com@a100:~/issue10413$ docker run --runtime=runsc --shm-size=128g --gpus=all --rm issue10413:latest
Hello from inside container.
Processes: current_process=psutil.Process(pid=1, name='python3', status='running', started='16:01:41') parent_process=None
Downloading https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz to data/cifar-100-python.tar.gz
100%|██████████| 169001437/169001437 [00:18<00:00, 9140159.09it/s] 
Extracting data/cifar-100-python.tar.gz to data
Files already downloaded and verified
Downloading: "https://download.pytorch.org/models/resnet50-11ad3fa6.pth" to /root/.cache/torch/hub/checkpoints/resnet50-11ad3fa6.pth
100%|██████████| 97.8M/97.8M [00:01<00:00, 74.1MB/s]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/usr/local/lib/python3.8/dist-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py:75: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `lightning.pytorch` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
/usr/local/lib/python3.8/dist-packages/lightning/pytorch/trainer/configuration_validator.py:72: You passed in a `val_dataloader` but have no `validation_step`. Skipping val loop.
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

Missing logger folder: /lightning_logs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name   | Type   | Params
----------------------------------
0 | module | ResNet | 23.7 M
----------------------------------
23.7 M    Trainable params
0         Non-trainable params
23.7 M    Total params
94.852    Total estimated model params size (MB)
Epoch 0: 100%|██████████| 391/391 [01:08<00:00,  5.68it/s, v_num=0]`Trainer.fit` stopped: `max_epochs=1` reached.
Epoch 0: 100%|██████████| 391/391 [01:09<00:00,  5.62it/s, v_num=0]
Training duration (seconds): 72.35

Surprisingly, this workload gets stuck without gVisor. I will add NV0000_CTRL_CMD_OS_UNIX_EXPORT_OBJECT_TO_FD to nvproxy though, hopefully it resolves whatever failure you are seeing.

Surprisingly, this workload gets stuck without gVisor.

Interesting. This may be the same problem as in #9827 where the test got stuck on runc.

The program doesn't get stuck on runc in Modal. It completes in around 60s. A 72.35 second completion for gVisor lines up with that.

I will add NV0000_CTRL_CMD_OS_UNIX_EXPORT_OBJECT_TO_FD to nvproxy though, hopefully it resolves whatever failure you are seeing.

🙏

@thundergolfer Let me know if e9b3218 fixes the issue. If so, please close this.