google / gvisor

Application Kernel for Containers

Home Page:https://gvisor.dev

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

'nvproxy: unknown frontend ioctl 212 == 0xd4' when doing DDP training on H100

thundergolfer opened this issue · comments

Description

Just ran into this and want to get it out in the open. Haven't spent time seeing if it's quick to fix myself.

grep -C 3 "nvproxy: unknow"  runsc.log.20240412-155449.022921.boot.txt
I0412 15:55:45.690845  1008391 strace.go:605] [  13:  35] python X accept(0x3b socket:[79], 0x7f4f1a5df1e0, 0x7f4f1a5df148 {length=28}) = 0 (0x0) errno=11 (request would block) (477ns)
D0412 15:55:45.690841  1008391 frontend.go:154] [  15:  42] nvproxy: frontend ioctl: nr = 0x000000d4, argSize = 0x00000080
I0412 15:55:45.690855  1008391 strace.go:570] [  16:  16] python E clock_nanosleep(0x0, 0x0, 0x7fff54247100 {sec=0 nsec=1000}, 0x0)
W0412 15:55:45.690854  1008391 frontend.go:180] [  15:  42] nvproxy: unknown frontend ioctl 212 == 0xd4 (argSize=128, cmd=0xc08046d4)
I0412 15:55:45.690861  1008391 strace.go:567] [  15:  36] python E accept(0x3c socket:[71], 0x7f5c57ddf1e0, 0x7f5c57ddf148 {length=28})
I0412 15:55:45.690863  1008391 strace.go:605] [  15:  42] python X ioctl(0x44 /dev/nvidiactl, 0xc08046d4, 0x7f5c573fcb90) = 0 (0x0) errno=22 (invalid argument) (21.739µs)
I0412 15:55:45.690863  1008391 strace.go:576] [  16:  38] python E recvfrom(0x40 socket:[109], 0x7ff887ddf150, 0x10, 0x40, 0x0, null)
--
I0412 15:55:45.719945  1008391 strace.go:605] [  14:  40] python X ioctl(0x3 /dev/nvidiactl, 0xc020462a, 0x7f0e269fb970) = 0 (0x0) (14.718µs)
D0412 15:55:45.719945  1008391 frontend.go:154] [  16:  41] nvproxy: frontend ioctl: nr = 0x000000d4, argSize = 0x00000080
I0412 15:55:45.719947  1008391 strace.go:608] [  14:  14] python X clock_nanosleep(0x0, 0x0, 0x7f0aaaa20100 {sec=0 nsec=1000}, null) = 0 (0x0) (4.897µs)
W0412 15:55:45.719951  1008391 frontend.go:180] [  16:  41] nvproxy: unknown frontend ioctl 212 == 0xd4 (argSize=128, cmd=0xc08046d4)
I0412 15:55:45.719936  1008391 strace.go:570] [  15:  15] python E clock_nanosleep(0x0, 0x0, 0x7f11361fe100 {sec=0 nsec=1000}, 0x0)
I0412 15:55:45.719946  1008391 strace.go:605] [  13:  35] python X recvmsg(0x40 socket:[135], 0x7f4f1a5df070 {name=0x0, namelen=0, iovecs=0x7f4f1a5df040 {base=0x7f4f1a5df03f, len=1}, control=0x7f4f1a5df050, control_len=24, flags=0}, 0x0) = 0 (0x0) errno=11 (request would block) (1.456µs)
I0412 15:55:45.719952  1008391 strace.go:608] [  16:  16] python X clock_nanosleep(0x0, 0x0, 0x7fff54247100 {sec=0 nsec=1000}, null) = 0 (0x0) (11.151µs)
--
I0412 15:55:45.720114  1008391 strace.go:570] [  15:  15] python E clock_nanosleep(0x0, 0x0, 0x7f11361fe100 {sec=0 nsec=1000}, 0x0)
D0412 15:55:45.720087  1008391 frontend.go:154] [  14:  40] nvproxy: frontend ioctl: nr = 0x000000d4, argSize = 0x00000080
I0412 15:55:45.720139  1008391 strace.go:608] [  14:  14] python X clock_nanosleep(0x0, 0x0, 0x7f0aaaa20100 {sec=0 nsec=1000}, null) = 0 (0x0) (4.953µs)
W0412 15:55:45.720140  1008391 frontend.go:180] [  14:  40] nvproxy: unknown frontend ioctl 212 == 0xd4 (argSize=128, cmd=0xc08046d4)
I0412 15:55:45.720143  1008391 strace.go:608] [  13:  13] python X clock_nanosleep(0x0, 0x0, 0x7f18483c8120 {sec=0 nsec=1000}, null) = 0 (0x0) (10.749µs)
I0412 15:55:45.720145  1008391 strace.go:605] [  14:  40] python X ioctl(0x45 /dev/nvidiactl, 0xc08046d4, 0x7f0e269fcb90) = 0 (0x0) errno=22 (invalid argument) (57.454µs)
I0412 15:55:45.720143  1008391 strace.go:614] [  16:  38] python X sendto(0x42 socket:[136], 0x7ff887ddf08c, 0x4, 0x4040, null, 0x0) = 4 (0x4) (14.349µs)
--
I0412 15:55:45.722291  1008391 strace.go:567] [  13:  39] python E ioctl(0x42 /dev/nvidiactl, 0xc08046d4, 0x7f4f1affcb90)
D0412 15:55:45.722326  1008391 frontend.go:154] [  13:  39] nvproxy: frontend ioctl: nr = 0x000000d4, argSize = 0x00000080
I0412 15:55:45.722326  1008391 strace.go:608] [  15:  15] python X clock_nanosleep(0x0, 0x0, 0x7f11361fe100 {sec=0 nsec=1000}, null) = 0 (0x0) (14.84µs)
W0412 15:55:45.722330  1008391 frontend.go:180] [  13:  39] nvproxy: unknown frontend ioctl 212 == 0xd4 (argSize=128, cmd=0xc08046d4)
I0412 15:55:45.722320  1008391 strace.go:605] [  15:  36] python X recvmsg(0x45 socket:[134], 0x7f5c57ddf070 {name=0x0, namelen=0, iovecs=0x7f5c57ddf040 {base=0x7f5c57ddf03f, len=1}, control=0x7f5c57ddf050, control_len=24, flags=0}, 0x0) = 0 (0x0) errno=11 (request would block) (1.129µs)
I0412 15:55:45.722328  1008391 strace.go:608] [  16:  16] python X clock_nanosleep(0x0, 0x0, 0x7fff54247100 {sec=0 nsec=1000}, null) = 0 (0x0) (5.468µs)
I0412 15:55:45.722337  1008391 strace.go:567] [  13:  35] python E recvmsg(0x40 socket:[135], 0x7f4f1a5df070 {name=0x0, namelen=0, iovecs=0x7f4f1a5df040 {base=0x7f4f1a5df03f, len=1}, control=0x7f4f1a5df050, control_len=24, flags=0}, 0x0)

Steps to reproduce

I unfortunately don't have much of a chance of getting a devbox with an H100 on it. But we have an open-source reproduction of the issue on Modal:

git clone git@github.com:amaciaszek-dsai/super-gradients.git
cd super-gradients
git checkout -t origin/modal_ddp
pip install .
modal run src/super_gradients/train_from_recipe_on_modal.py

On Modal this program will complete if gpu = modal.gpu.A10G(count=4) is used instead of gpu = modal.gpu.H100(count=4).

image

But on H100s, it gets stuck:

image

runsc version

runsc version 6b93f104576f
spec: 1.1.0-rc.1

6b93f104576f1240590714d0058bec3cb9d738ef

docker version (if using docker)

No response

uname

Linux gcp-h100-us-east4-a-0-a9450787-286a-48a9-8eb1-cce882bbf0a7 5.15.0-204.147.6.3.el9uek.x86_64 #2 SMP Mon Apr 1 10:28:43 PDT 2024 x86_64 x86_64 x86_64 GNU/Linux

kubectl (if using Kubernetes)

No response

repo state (if built from source)

No response

runsc debug logs (if available)

The full .boot.txt file is over 7GiB. I can upload it to s3 if you want, but it should be sufficient to have just the nvproxy logs snippet shown above.