Mesh shader payload causes crashes, GPU resets and system freezes

Question

Mesh shader payload causes crashes, GPU resets and system freezes

Firestar99 opened this issue 21 days ago · comments

Description

Using the payload of vulkan mesh shaders has been quite troublesome, both with RenderDoc and some drivers. There seems to be quite a disagreement on how the payload should be declared in the spirv. Specifically, on whether it must be a struct (similarly to a buffer block) or can also just be a plain uint.

In the spirv spec I can't find anything specified about it.
The vulkan spec states:

Task shader payloads can be declared in task and mesh shaders using the new taskPayloadSharedEXT storage qualifier as follows:

taskPayloadSharedEXT MyPayloadStruct {
	...
} payload;

Note the can, it is never stated as a firm requirement. Also, this particular code does not actually compile with glslc due to a missing struct.

glslc happily compiles the payload as a struct or a plain uint, and generates spirv that matches the source code:

payload as struct:

taskPayloadSharedEXT struct Payload {
	uint id;
} payload;

%Payload = OpTypeStruct %uint
%_ptr_TaskPayloadWorkgroupEXT_Payload = OpTypePointer TaskPayloadWorkgroupEXT %Payload
%payload = OpVariable %_ptr_TaskPayloadWorkgroupEXT_Payload TaskPayloadWorkgroupEXT

payload as uint:

taskPayloadSharedEXT uint payload;

%_ptr_TaskPayloadWorkgroupEXT_uint = OpTypePointer TaskPayloadWorkgroupEXT %uint
%payload = OpVariable %_ptr_TaskPayloadWorkgroupEXT_uint TaskPayloadWorkgroupEXT

=> Thus I assume it's fine to declare payloads both as a struct or as a plain uint

However, RenderDoc and a few drivers do not seem to comply with this statement:

	NO payload app	NO payload Renderdoc	struct payload app	struct payload RenderDoc	uint payload app	uint payload RenderDoc
Linux RADV 23.2.1	✓	✓	✓	GPU reset: ring comp_1.2.0 timeout, capture	✓	GPU reset: ring comp_1.2.0 timeout, capture
Linux RADV 24.0.7	✓	✓	✓	GPU reset: ring comp_1.0.1 timeout, capture	✓	GPU reset: ring comp_1.1.0 timeout, capture
Linux AMDVLK	??	??	✓	full system freeze when selecting draw, capture	✓	mesh viewer: Invalid task payload, likely generated by dxc bug, capture
Windows AMD	✓	✓	✓	RenderDoc crash, capture, dump	Driver segfault on pipeline creation	N/A
Windows Nvidia	✓	✓	✓	✓ capture	✓	mesh viewer: Invalid task payload, likely generated by dxc bug, capture

Explanations:

app: just starting my app, with or without RenderDoc attached
RenderDoc: loading a capture of my app in RenderDoc
NO payload: payload between task and mesh were not declared or used in shaders
struct payload: declared the payload as a struct { uint id; }, see above for code
uint payload: declared the payload as an uint, see above for code
mesh viewer: Invalid task payload, likely generated by dxc bug: That's what's displayed within RenderDoc's the mesh viewer "vertex" table when inspecting a mesh shader draw.
GPU reset: ring comp_1.2.0 timeout: log from journalctl contains something like:

kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring comp_1.2.0 timeout, signaled seq=7079, emitted seq=7080
kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process qrenderdoc pid 11365 thread ReplayManager pid 11718

Could you also help me with where I should report the Windows AMD driver bug? Thanks :D

Steps to reproduce

Download any of the captures above
Load them in RenderDoc
Observe behaviour as specified in the RenderDoc columns

Environment

RenderDoc version: 1.32
Operating System: Windows / Linux, see table
Graphics API: Vulkan

Baldur Karlsson · Answer 1 · Sun May 12 2024 20:40:46 GMT+0800 (China Standard Time)

I don't think a non-struct payload is intended to work, especially since it is always a struct in HLSL/D3D12 where the feature originated as well as that there is VU requiring only one variable per entry point so a non-struct is a somewhat degenerate case. I'll see about getting clarification. That language you quoted is in the proposal document not VUs, and I expect it's intended just to say that payloads are not required.

It looks like your radv version is very old, 23.2.1, have you tried testing on an up to date driver?

Firestar99 · Answer 2 · Mon May 13 2024 18:35:41 GMT+0800 (China Standard Time)

Updated my radv version to 24.0.7 (device), but still the very same error: capture_struct_payload capture_uint_payload

kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring comp_1.1.0 timeout, signaled seq=30249, emitted seq=30250
kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process qrenderdoc pid 26623 thread ReplayManager pid 26660

Seems like the comp rings can differ, previously I've only seen comp_1.2.0 but now I'm seeing comp_1.1.0 and comp_1.0.1, don't know if that's any significant though.

Anyhow, I also completed the table with more testing results on Windows. Notably encountered a RenderDoc crash on AMD with payload structs when selecting the mesh shader draw cmd iirc, see capture and dump in table.

Baldur Karlsson · Answer 3 · Mon May 13 2024 20:18:28 GMT+0800 (China Standard Time)

The struct capture works fine for me on radv 24.0.7:

I'm running quite a different GPU to you so this might be something GPU-specific, but I don't see any validation warnings on the mesh output fetch and since it works for me I think you would need to report this to mesa. As far as I'm aware RenderDoc's mesh shader support is working OK for other people on mesa so this may be a device-specific bug. The mesa folks will be better placed to diagnose the problem and report back to me if it's a RenderDoc bug after all.

I get a similarly successful result on amdvlk, so it may be something in common between the two.

The windows AMD bug looks like a crash I have previously reported to them, though it may be different so it may be worth reporting to them yourself.

Firestar99 · Answer 4 · Tue May 14 2024 17:19:21 GMT+0800 (China Standard Time)

Mesa bug report: https://gitlab.freedesktop.org/mesa/mesa/-/issues/11156

~~When debugging with RADV_DEBUG=hang it interestingly states it's this pipeline, so most likely not even a RenderDoc bug [...]~~

Firestar99 · Answer 5 · Thu May 16 2024 22:24:50 GMT+0800 (China Standard Time)

For further investigation I got myself a clean system and managed to reproduce the bug there as well. However, some special conditions seem to be required for it to trigger. Could you please try to reproduce it again with these new repo instructions?

Open the RADV capture and observe Renderdoc working as expected
Install AMDVLK deb package on your system
Delete /etc/vulkan/implicit_layer.d/amd_icd64.json to remove the VK_LAYER_AMD_switchable_graphics_64 implicit layer, which forces you to always use the amdvlk driver
verify that vulkanCapsViewer can see both drivers, RADV with AMD Radeon Graphics (RADV REMBRANDT) and amdvlk with AMD Radeon Graphics (what a stupid naming)
Open the same RADV capture again, but this time observe Renderdoc freezing, likely followed by a GPU reset or system freeze

My current conclusion is that an amdvlk device being available, even though it is unused, is enough to cause the Renderdoc to freeze. I wanted to run that issue by you, in case it's something to do with RenderDoc's device selection, before going back to asking the RADV team.

Here's a log of "Open Capture with Options" with the RADV device explicitly selected and API validation turned on:
RenderDoc_2024.05.16_16.07.32.log

Baldur Karlsson · Answer 6 · Fri May 17 2024 01:14:51 GMT+0800 (China Standard Time)

There's no way I can see for just having a driver installed to cause a crash because of a RenderDoc bug. RenderDoc by default selects the closest matching physical device by hardware and driver, overriding it is not recommended but in either event the presence of other drivers won't cause a problem there either. Unless you can specifically find evidence of a RenderDoc bug I don't think it seems possible.

I see the mesa issue has been closed so I will close this one as well.