'nvproxy: unknown frontend ioctl 212 == 0xd4' when doing DDP training on H100
thundergolfer opened this issue · comments
Description
Just ran into this and want to get it out in the open. Haven't spent time seeing if it's quick to fix myself.
grep -C 3 "nvproxy: unknow" runsc.log.20240412-155449.022921.boot.txt
I0412 15:55:45.690845 1008391 strace.go:605] [ 13: 35] python X accept(0x3b socket:[79], 0x7f4f1a5df1e0, 0x7f4f1a5df148 {length=28}) = 0 (0x0) errno=11 (request would block) (477ns)
D0412 15:55:45.690841 1008391 frontend.go:154] [ 15: 42] nvproxy: frontend ioctl: nr = 0x000000d4, argSize = 0x00000080
I0412 15:55:45.690855 1008391 strace.go:570] [ 16: 16] python E clock_nanosleep(0x0, 0x0, 0x7fff54247100 {sec=0 nsec=1000}, 0x0)
W0412 15:55:45.690854 1008391 frontend.go:180] [ 15: 42] nvproxy: unknown frontend ioctl 212 == 0xd4 (argSize=128, cmd=0xc08046d4)
I0412 15:55:45.690861 1008391 strace.go:567] [ 15: 36] python E accept(0x3c socket:[71], 0x7f5c57ddf1e0, 0x7f5c57ddf148 {length=28})
I0412 15:55:45.690863 1008391 strace.go:605] [ 15: 42] python X ioctl(0x44 /dev/nvidiactl, 0xc08046d4, 0x7f5c573fcb90) = 0 (0x0) errno=22 (invalid argument) (21.739µs)
I0412 15:55:45.690863 1008391 strace.go:576] [ 16: 38] python E recvfrom(0x40 socket:[109], 0x7ff887ddf150, 0x10, 0x40, 0x0, null)
--
I0412 15:55:45.719945 1008391 strace.go:605] [ 14: 40] python X ioctl(0x3 /dev/nvidiactl, 0xc020462a, 0x7f0e269fb970) = 0 (0x0) (14.718µs)
D0412 15:55:45.719945 1008391 frontend.go:154] [ 16: 41] nvproxy: frontend ioctl: nr = 0x000000d4, argSize = 0x00000080
I0412 15:55:45.719947 1008391 strace.go:608] [ 14: 14] python X clock_nanosleep(0x0, 0x0, 0x7f0aaaa20100 {sec=0 nsec=1000}, null) = 0 (0x0) (4.897µs)
W0412 15:55:45.719951 1008391 frontend.go:180] [ 16: 41] nvproxy: unknown frontend ioctl 212 == 0xd4 (argSize=128, cmd=0xc08046d4)
I0412 15:55:45.719936 1008391 strace.go:570] [ 15: 15] python E clock_nanosleep(0x0, 0x0, 0x7f11361fe100 {sec=0 nsec=1000}, 0x0)
I0412 15:55:45.719946 1008391 strace.go:605] [ 13: 35] python X recvmsg(0x40 socket:[135], 0x7f4f1a5df070 {name=0x0, namelen=0, iovecs=0x7f4f1a5df040 {base=0x7f4f1a5df03f, len=1}, control=0x7f4f1a5df050, control_len=24, flags=0}, 0x0) = 0 (0x0) errno=11 (request would block) (1.456µs)
I0412 15:55:45.719952 1008391 strace.go:608] [ 16: 16] python X clock_nanosleep(0x0, 0x0, 0x7fff54247100 {sec=0 nsec=1000}, null) = 0 (0x0) (11.151µs)
--
I0412 15:55:45.720114 1008391 strace.go:570] [ 15: 15] python E clock_nanosleep(0x0, 0x0, 0x7f11361fe100 {sec=0 nsec=1000}, 0x0)
D0412 15:55:45.720087 1008391 frontend.go:154] [ 14: 40] nvproxy: frontend ioctl: nr = 0x000000d4, argSize = 0x00000080
I0412 15:55:45.720139 1008391 strace.go:608] [ 14: 14] python X clock_nanosleep(0x0, 0x0, 0x7f0aaaa20100 {sec=0 nsec=1000}, null) = 0 (0x0) (4.953µs)
W0412 15:55:45.720140 1008391 frontend.go:180] [ 14: 40] nvproxy: unknown frontend ioctl 212 == 0xd4 (argSize=128, cmd=0xc08046d4)
I0412 15:55:45.720143 1008391 strace.go:608] [ 13: 13] python X clock_nanosleep(0x0, 0x0, 0x7f18483c8120 {sec=0 nsec=1000}, null) = 0 (0x0) (10.749µs)
I0412 15:55:45.720145 1008391 strace.go:605] [ 14: 40] python X ioctl(0x45 /dev/nvidiactl, 0xc08046d4, 0x7f0e269fcb90) = 0 (0x0) errno=22 (invalid argument) (57.454µs)
I0412 15:55:45.720143 1008391 strace.go:614] [ 16: 38] python X sendto(0x42 socket:[136], 0x7ff887ddf08c, 0x4, 0x4040, null, 0x0) = 4 (0x4) (14.349µs)
--
I0412 15:55:45.722291 1008391 strace.go:567] [ 13: 39] python E ioctl(0x42 /dev/nvidiactl, 0xc08046d4, 0x7f4f1affcb90)
D0412 15:55:45.722326 1008391 frontend.go:154] [ 13: 39] nvproxy: frontend ioctl: nr = 0x000000d4, argSize = 0x00000080
I0412 15:55:45.722326 1008391 strace.go:608] [ 15: 15] python X clock_nanosleep(0x0, 0x0, 0x7f11361fe100 {sec=0 nsec=1000}, null) = 0 (0x0) (14.84µs)
W0412 15:55:45.722330 1008391 frontend.go:180] [ 13: 39] nvproxy: unknown frontend ioctl 212 == 0xd4 (argSize=128, cmd=0xc08046d4)
I0412 15:55:45.722320 1008391 strace.go:605] [ 15: 36] python X recvmsg(0x45 socket:[134], 0x7f5c57ddf070 {name=0x0, namelen=0, iovecs=0x7f5c57ddf040 {base=0x7f5c57ddf03f, len=1}, control=0x7f5c57ddf050, control_len=24, flags=0}, 0x0) = 0 (0x0) errno=11 (request would block) (1.129µs)
I0412 15:55:45.722328 1008391 strace.go:608] [ 16: 16] python X clock_nanosleep(0x0, 0x0, 0x7fff54247100 {sec=0 nsec=1000}, null) = 0 (0x0) (5.468µs)
I0412 15:55:45.722337 1008391 strace.go:567] [ 13: 35] python E recvmsg(0x40 socket:[135], 0x7f4f1a5df070 {name=0x0, namelen=0, iovecs=0x7f4f1a5df040 {base=0x7f4f1a5df03f, len=1}, control=0x7f4f1a5df050, control_len=24, flags=0}, 0x0)
Steps to reproduce
I unfortunately don't have much of a chance of getting a devbox with an H100 on it. But we have an open-source reproduction of the issue on Modal:
git clone git@github.com:amaciaszek-dsai/super-gradients.git
cd super-gradients
git checkout -t origin/modal_ddp
pip install .
modal run src/super_gradients/train_from_recipe_on_modal.py
On Modal this program will complete if gpu = modal.gpu.A10G(count=4)
is used instead of gpu = modal.gpu.H100(count=4)
.
But on H100s, it gets stuck:
runsc version
runsc version 6b93f104576f
spec: 1.1.0-rc.1
6b93f104576f1240590714d0058bec3cb9d738ef
docker version (if using docker)
No response
uname
Linux gcp-h100-us-east4-a-0-a9450787-286a-48a9-8eb1-cce882bbf0a7 5.15.0-204.147.6.3.el9uek.x86_64 #2 SMP Mon Apr 1 10:28:43 PDT 2024 x86_64 x86_64 x86_64 GNU/Linux
kubectl (if using Kubernetes)
No response
repo state (if built from source)
No response
runsc debug logs (if available)
The full .boot.txt
file is over 7GiB. I can upload it to s3 if you want, but it should be sufficient to have just the nvproxy logs snippet shown above.