dmabuf issue with multiple GPUs

Question

dmabuf issue with multiple GPUs

retrotails opened this issue 2 years ago · comments

as best as I can tell, this is caused by multiple GPUs, but I'm not certain.
on my system with dedicated+iGPU (AMD,intel), wlvncc crashes before showing anything on screen.
it works when I have just my integrated GPU, and it works on another system with no iGPU and just the dedicated GPU.
software rendering with "-s" makes it work
~~(side note: why does the server seem to stop using VAAPI to encode when the client uses software decoding?)~~ (this is due to software mode not supporting h264, which forces the server to use something else like "tight")
I tried running with WAYLAND_DEBUG=client, unfortunately it doesn't look very helpful since the output is nearly identical to my machine that works. here's the end of the log:

03/11/2022 01:38:50 VNC server supports protocol version 3.8 (viewer 3.8)
03/11/2022 01:38:50 We have 1 security types to read
03/11/2022 01:38:50 0) Received security type 1
03/11/2022 01:38:50 Selecting security type 1 (0/1 in the list)
03/11/2022 01:38:50 Selected Security Scheme 1
03/11/2022 01:38:50 No authentication needed
03/11/2022 01:38:50 VNC authentication succeeded
03/11/2022 01:38:50 Desktop name "WayVNC"
03/11/2022 01:38:50 Connected to VNC server, using protocol version 3.8
03/11/2022 01:38:50 VNC server default format:
03/11/2022 01:38:50   32 bits per pixel.
03/11/2022 01:38:50   Least significant byte first in each pixel.
03/11/2022 01:38:50   TRUE colour: max red 255 green 255 blue 255, shift red 16 green 8 blue 0
[1597053.041]  -> wl_compositor@6.create_surface(new id wl_surface@3)
[1597053.045]  -> xdg_wm_base@7.get_xdg_surface(new id xdg_surface@13, wl_surface@3)
[1597053.048]  -> xdg_surface@13.get_toplevel(new id xdg_toplevel@14)
[1597053.050]  -> xdg_toplevel@14.set_app_id("wlvncc")
[1597053.052]  -> xdg_toplevel@14.set_title("WayVNC")
[1597053.054]  -> wl_surface@3.commit()
[1597078.251]  -> zwp_linux_dmabuf_v1@5.create_params(new id zwp_linux_buffer_params_v1@15)
[1597078.311]  -> zwp_linux_buffer_params_v1@15.add(fd 19, 0, 0, 30720, 16777216, 2)
[1597078.315]  -> zwp_linux_buffer_params_v1@15.create_immed(new id wl_buffer@16, 7680, 4200, 875713112, 0)
[1597078.318]  -> zwp_linux_buffer_params_v1@15.destroy()
[1597105.807]  -> zwp_linux_dmabuf_v1@5.create_params(new id zwp_linux_buffer_params_v1@17)
[1597105.837]  -> zwp_linux_buffer_params_v1@17.add(fd 20, 0, 0, 30720, 16777216, 2)
[1597105.840]  -> zwp_linux_buffer_params_v1@17.create_immed(new id wl_buffer@18, 7680, 4200, 875713112, 0)
[1597105.842]  -> zwp_linux_buffer_params_v1@17.destroy()
[1597123.747]  -> zwp_linux_dmabuf_v1@5.create_params(new id zwp_linux_buffer_params_v1@19)
[1597123.765]  -> zwp_linux_buffer_params_v1@19.add(fd 21, 0, 0, 30720, 16777216, 2)
[1597123.767]  -> zwp_linux_buffer_params_v1@19.create_immed(new id wl_buffer@20, 7680, 4200, 875713112, 0)
[1597123.770]  -> zwp_linux_buffer_params_v1@19.destroy()
[1597124.074] wl_display@1.error(nil, 7, "importing the supplied dmabufs failed")
[destroyed object]: error 7: importing the supplied dmabufs failed
wlvncc: ../src/main.c:230: on_wayland_event: Assertion `rc == 0' failed.
fish: Job 1, 'env WAYLAND_DEBUG=client ./buil…' terminated by signal SIGABRT (Abort)

I tried editing main.c to force using several other DRM_FORMAT_* options, and while the WAYLAND_DEBUG output confirmed those changes worked ("875713112" became "875709016, for example), the error remained identical, so I don't think it's related to the pixel format. I also tried setting the server resolution smaller (1920x1080), disabling the client's display scaling, blacklisting i915, reverting to older wlvncc commits etc. so far, only enabling software rendering gets it to work.

Andri Yngvason · Answer 1 · Thu Nov 03 2022 19:37:25 GMT+0800 (China Standard Time)

Render node gets chosen here: https://github.com/any1/wlvncc/blob/master/src/main.c#L707

It's probably just choosing the wrong GPU. There's a recent protocol extension which lets the compositor tell the client which render node to choose, or rather, it just passes the fd to the client.

The hw video decoder needs GL rendering. Without it, we would have to download whole frames from the GPU and dump them into SHM buffers which isn't very efficient.

retrotails · Answer 2 · Thu Nov 03 2022 23:27:23 GMT+0800 (China Standard Time)

I tried a hack to use the other device

-for (int i = 0; i < n; ++i) {
+for (int i = 1; i < n; ++i) {

(also, n == 2)
with this, wlvncc starts, but the window is completely black. input works and there's no errors in the console.
I thought maybe it was trying to use the hardware encoder from the wrong GPU, so I also tried explicitly setting the vaapi device to both options I have, and that seems to have no effect. doesn't crash, but the window is still black no matter which renderD*.

// open-h264.c:96
if (av_hwdevice_ctx_create(&context->hwctx_ref, AV_HWDEVICE_TYPE_VAAPI,
			"/dev/dri/renderD129", NULL, 0) != 0)

Andri Yngvason · Answer 3 · Thu Nov 03 2022 23:47:48 GMT+0800 (China Standard Time)

Does choosing a different render node work if you use a different encoding method such as "tight"?

retrotails · Answer 4 · Fri Nov 04 2022 00:14:00 GMT+0800 (China Standard Time)

no, still black. I also tried "raw"

Andri Yngvason · Answer 5 · Fri Nov 04 2022 00:26:06 GMT+0800 (China Standard Time)

That rules out the h264 decoder as the source of black frames.

retrotails · Answer 6 · Sun Nov 06 2022 17:39:05 GMT+0800 (China Standard Time)

trying to look into it again, I noticed with WAYLAND_DEBUG on and the render node hack on, the last line here stands out:

[1158330.898]  -> xdg_toplevel@15.set_app_id("wlvncc")
[1158330.900]  -> xdg_toplevel@15.set_title("WayVNC")
[1158330.902]  -> wl_surface@3.commit()
[1158331.386]  -> zwp_linux_dmabuf_v1@5.create_params(new id zwp_linux_buffer_params_v1@16)
[1158331.412]  -> zwp_linux_buffer_params_v1@16.add(fd 21, 0, 0, 15360, 16777215, 4294967295)

compared to output on my device that works:

[1373309.858]  -> zwp_linux_buffer_params_v1@15.add(fd 19, 0, 0, 15360, 16777216, 2)

it appears to be "DRM_FORMAT_MOD_INVALID"
https://wayland.app/protocols/linux-dmabuf-unstable-v1#zwp_linux_buffer_params_v1:request:add
https://github.com/any1/wlvncc/blob/master/protocols/linux-dmabuf-unstable-v1.xml#L142
when I don't have the render node hack enabled (and thus wlvncc crashes), the values are "16777216, 2" just like the working machine.
it's difficult for me to search further, wayland documentation is not easy to follow.

edit: also, the render node hack chooses renderD129, which is the integrated graphics and probably not the correct card, the default behavior chooses renderD128 which is likely the correct card, yet that crashes. I would guess, forcing it to use the wrong render node causes the "DRM_FORMAT_MOD_INVALID", which somehow allows the wlvncc process to skip the code that crashes and continue to create an (albeit black) window. if it was possible to select a completely invalid render node, that would probably also create a black window and not crash.

retrotails · Answer 7 · Mon Nov 07 2022 10:59:35 GMT+0800 (China Standard Time)

I managed to hack things well enough to work for my specific use case.
first I disabled my iGPU:
echo -n "0000:00:02.0" > /sys/bus/pci/drivers/i915/unbind
after which, wlvncc would still crash with a bizarre error, which I don't think is helpful:

06/11/2022 21:18:14 Unknown rect encoding 50
Exiting...

the above crash was fixed when I explicitly set the render device to /dev/dri/renderD129 in open-h264.c as my comment above, and wlvncc then works as normal. when I disable my iGPU, renderD129 is the only device that shows up, skipping renderD128, which apparently trips up something and causes the above error.

these hacks get my setup functional, but this doesn't help anyone who doesn't want to disable their GPUs. the other downside is, the hardware decoder in my iGPU is better than the one in my dedicated GPU, so performance would be better if it's possible to utilize it.
also, my above comment's edit was partially wrong, renderD128 is the iGPU, renderD129 is the dedicated GPU. that continues to be the case after the iGPU is disabled, with renderD128/card0 disappearing. I was misled by "vainfo", which will happily give you info for the first card it finds, and doesn't tell you when your commandline arguments have a typo...
this does at least tell me, something is trying to use the intel GPU that shouldn't be, because explicitly setting everything I know of to use the dedicated GPU fails when the iGPU is visible, and works when the iGPU is disabled.

Andri Yngvason · Answer 8 · Mon Nov 07 2022 20:36:18 GMT+0800 (China Standard Time)

It is worth noting that the decoder hardware is chosen internally by libavcodec. This is likely to cause problems. I think the API makes it possible to choose, so we should look into that.

retrotails · Answer 9 · Tue Nov 08 2022 04:27:31 GMT+0800 (China Standard Time)

is that not what this does?

if (av_hwdevice_ctx_create(&context->hwctx_ref, AV_HWDEVICE_TYPE_VAAPI,
-			NULL, NULL, 0) != 0)
+			"/dev/dri/renderD129", NULL, 0) != 0)

this hack is necessary in addition to disabling my iGPU.
when I have both GPUs enabled, I've tried all 4 combinations of render node and "av_hwdevice_ctx_create" and none work.

Andri Yngvason · Answer 10 · Tue Nov 08 2022 06:13:44 GMT+0800 (China Standard Time)

Yeah, that's how you do it.

I suppose av_hwframe_map might be failing. You could try replacing AV_HWFRAME_MAP_DIRECT with 0.

There are a few other things that can fail. Some trace logging from ffmpeg might help. You can try adding av_log_set_level(AV_LOG_TRACE) to open_h264_create. If that doesn't tell you anything useful, printing out error messages when errors occur in open-h264.c will at least tell you which step failed.

retrotails · Answer 11 · Tue Nov 08 2022 06:57:46 GMT+0800 (China Standard Time)

I enabled that logging; comparing a successful run vs a black screen run, the output looks completely identical. this leads me to think there's precisely two issues:

the wrong render device is often chosen, which can be caused by the first card not being renderD128, or from the first card not being the active GPU.
fixing problem 1 with a hack, h264 appears to work fine and the window gets created with no errors but the screen is entirely black. after fixing problem 1, software rendering works, which I believe narrows it down to the EGL renderer.

I did also try messing with the fragment shaders, setting everything to magenta to see if it made any difference. it does not.

gl_FragColor = vec4(1.0,0.0,1.0,1.0);

this still results in the window being black.

edit: I see "3D" GPU utilization (and no "video" utilization, meaning the video decoder isn't being used) in my integrated card using intel_gpu_top when the window is black, meaning the iGPU is probably rendering the openGL when it shouldn't be.
edit 2: and yes, the wrong GPU has 3D utilization with the render node hack also enabled. without the render node hack, it doesn't even show a window it just crashes.

retrotails · Answer 12 · Tue Nov 08 2022 08:03:31 GMT+0800 (China Standard Time)

I was able to mess around with EGL stuff enough to get it working on my machine, by copying some code from here: https://stackoverflow.com/a/66110209
also note how EGL_PLATFORM_SURFACELESS_MESA was changed to EGL_PLATFORM_DEVICE_EXT, I'm not even sure what that does but it's needed.
this is the complete diff of hacks that fix everything for me

diff --git a/src/main.c b/src/main.c
index 82b0688..31695b9 100644
--- a/src/main.c
+++ b/src/main.c
@@ -709,7 +709,7 @@ static int find_render_node(char *node, size_t maxlen) {
 	drmDevice *devices[64];
 
 	int n = drmGetDevices2(0, devices, sizeof(devices) / sizeof(devices[0]));
-	for (int i = 0; i < n; ++i) {
+	for (int i = 1; i < n; ++i) {
 		drmDevice *dev = devices[i];
 		if (!(dev->available_nodes & (1 << DRM_NODE_RENDER)))
 			continue;
diff --git a/src/open-h264.c b/src/open-h264.c
index 662210d..fcfe07a 100644
--- a/src/open-h264.c
+++ b/src/open-h264.c
@@ -94,7 +94,7 @@ static struct open_h264_context* open_h264_context_create(
 		goto failure;
 
 	if (av_hwdevice_ctx_create(&context->hwctx_ref, AV_HWDEVICE_TYPE_VAAPI,
-				NULL, NULL, 0) != 0)
+				"/dev/dri/renderD129", NULL, 0) != 0)
 		goto failure;
 
 	context->codec_ctx->hw_device_ctx = av_buffer_ref(context->hwctx_ref);
diff --git a/src/renderer-egl.c b/src/renderer-egl.c
index 125bbb4..7b9d2bc 100644
--- a/src/renderer-egl.c
+++ b/src/renderer-egl.c
@@ -177,8 +177,14 @@ int egl_init(void)
 	if (egl_load_egl_ext() < 0)
 		return -1;
 
-	egl_display = eglGetPlatformDisplayEXT(EGL_PLATFORM_SURFACELESS_MESA,
-			EGL_DEFAULT_DISPLAY, NULL);
+	EGLDeviceEXT eglDevs[32];
+	EGLint numDevices;
+	PFNEGLQUERYDEVICESEXTPROC eglQueryDevicesEXT = (PFNEGLQUERYDEVICESEXTPROC)
+	eglGetProcAddress("eglQueryDevicesEXT");
+	eglQueryDevicesEXT(32, eglDevs, &numDevices);
+
+	egl_display = eglGetPlatformDisplayEXT(EGL_PLATFORM_DEVICE_EXT,
+			eglDevs[1], NULL);
 	if (egl_display == EGL_NO_DISPLAY)
 		return -1;

Andri Yngvason · Answer 13 · Tue Nov 08 2022 16:47:16 GMT+0800 (China Standard Time)

Very good.

A complete solution would allow selecting the render node via a command line argument and otherwise fall back to the result of find_render_node for all of the above.

Do you want to make a PR for this?

retrotails · Answer 14 · Wed Nov 09 2022 07:38:17 GMT+0800 (China Standard Time)

I don't think I know enough about how all of this works to make a PR.
what I do know:
find_render_node can likely be automated. if the wrong node is chosen, it fails later down the line at on_wayland_event

[destroyed object]: error 7: importing the supplied dmabufs failed
wlvncc: ../src/main.c:230: on_wayland_event: Assertion `rc == 0' failed.

I guess you can just pick a render node by trial-and-error this way. you probably want to catch this error earlier, I'm not sure how though.
the node selected in find_render_node also needs to get passed to open-h264.c in av_hwdevice_ctx_create().

the hard part seems to be the egl device. even if the correct /dev/dri/renderD* node is chosen above, I have no idea how to use that information to select the correct egl device. on my system, eglQueryDevicesEXT() reports 3 devices. one crashes, one works, and one has the black screen. as far as I know, they can only be indexed by an integer, and the order doesn't necessarily match /dev/dri/renderD*

so instead, I tried trial-and-error, checking which egl devices work.
for me, EGL_DEFAULT_DISPLAY becomes EGL_NO_DISPLAY which is easy to catch, so I wrote some code to try the other egl devices after that failure

	egl_display = eglGetPlatformDisplayEXT(EGL_PLATFORM_DEVICE_EXT,
			EGL_DEFAULT_DISPLAY, NULL);
	
	if (egl_display == EGL_NO_DISPLAY) {
		EGLDeviceEXT eglDevs[32];
		EGLint numDevices;
		PFNEGLQUERYDEVICESEXTPROC eglQueryDevicesEXT = (PFNEGLQUERYDEVICESEXTPROC)
		eglGetProcAddress("eglQueryDevicesEXT");
		eglQueryDevicesEXT(32, eglDevs, &numDevices);

		for (EGLint i = 0; i < numDevices; ++i) {
			egl_display = eglGetPlatformDisplayEXT(EGL_PLATFORM_DEVICE_EXT,
				eglDevs[i], NULL);
			if (egl_display != EGL_NO_DISPLAY)
				break;
			if (i == numDevices)
				return -1;
		}
	}

the problem is, on my machine this selects eglDevs[0], which does not give an error message, but the screen is completely black. if I manually select eglDevs[1], it works. but I don't know any way to detect (in code) when there's a black screen, since everything functions without any error messages. I tried adding this (as I found online somewhere):

if (eglGetError() != EGL_SUCCESS)
		goto failure;

but I get EGL_SUCCESS even with the black screen.

Andri Yngvason · Answer 15 · Wed Nov 09 2022 18:41:22 GMT+0800 (China Standard Time)

The "correct" way is to use this to get the DRM render node: https://wayland.app/protocols/linux-dmabuf-unstable-v1#zwp_linux_dmabuf_feedback_v1:event:main_device

However, an intermediate step would be to allow the user to choose the render node. It's simple and doesn't require much work.