baldurk / renderdoc

RenderDoc is a stand-alone graphics debugging tool.

Home Page:https://renderdoc.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Mesh shader payload causes crashes, GPU resets and system freezes

Firestar99 opened this issue · comments

Description

Using the payload of vulkan mesh shaders has been quite troublesome, both with RenderDoc and some drivers. There seems to be quite a disagreement on how the payload should be declared in the spirv. Specifically, on whether it must be a struct (similarly to a buffer block) or can also just be a plain uint.

  1. In the spirv spec I can't find anything specified about it.

  2. The vulkan spec states:

Task shader payloads can be declared in task and mesh shaders using the new taskPayloadSharedEXT storage qualifier as follows:

taskPayloadSharedEXT MyPayloadStruct {
	...
} payload;

Note the can, it is never stated as a firm requirement. Also, this particular code does not actually compile with glslc due to a missing struct.

  1. glslc happily compiles the payload as a struct or a plain uint, and generates spirv that matches the source code:

payload as struct:

taskPayloadSharedEXT struct Payload {
	uint id;
} payload;
%Payload = OpTypeStruct %uint
%_ptr_TaskPayloadWorkgroupEXT_Payload = OpTypePointer TaskPayloadWorkgroupEXT %Payload
%payload = OpVariable %_ptr_TaskPayloadWorkgroupEXT_Payload TaskPayloadWorkgroupEXT

payload as uint:

taskPayloadSharedEXT uint payload;
%_ptr_TaskPayloadWorkgroupEXT_uint = OpTypePointer TaskPayloadWorkgroupEXT %uint
%payload = OpVariable %_ptr_TaskPayloadWorkgroupEXT_uint TaskPayloadWorkgroupEXT

=> Thus I assume it's fine to declare payloads both as a struct or as a plain uint

However, RenderDoc and a few drivers do not seem to comply with this statement:

NO payload app NO payload Renderdoc struct payload app struct payload RenderDoc uint payload app uint payload RenderDoc
Linux RADV 23.2.1 GPU reset: ring comp_1.2.0 timeout, capture GPU reset: ring comp_1.2.0 timeout, capture
Linux RADV 24.0.7 GPU reset: ring comp_1.0.1 timeout, capture GPU reset: ring comp_1.1.0 timeout, capture
Linux AMDVLK ?? ?? full system freeze when selecting draw, capture mesh viewer: Invalid task payload, likely generated by dxc bug, capture
Windows AMD RenderDoc crash, capture, dump Driver segfault on pipeline creation N/A
Windows Nvidia capture mesh viewer: Invalid task payload, likely generated by dxc bug, capture

Explanations:

  • app: just starting my app, with or without RenderDoc attached
  • RenderDoc: loading a capture of my app in RenderDoc
  • NO payload: payload between task and mesh were not declared or used in shaders
  • struct payload: declared the payload as a struct { uint id; }, see above for code
  • uint payload: declared the payload as an uint, see above for code
  • mesh viewer: Invalid task payload, likely generated by dxc bug: That's what's displayed within RenderDoc's the mesh viewer "vertex" table when inspecting a mesh shader draw.
  • GPU reset: ring comp_1.2.0 timeout: log from journalctl contains something like:
kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring comp_1.2.0 timeout, signaled seq=7079, emitted seq=7080
kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process qrenderdoc pid 11365 thread ReplayManager pid 11718

Could you also help me with where I should report the Windows AMD driver bug? Thanks :D

Steps to reproduce

  1. Download any of the captures above
  2. Load them in RenderDoc
  3. Observe behaviour as specified in the RenderDoc columns

Environment

  • RenderDoc version: 1.32
  • Operating System: Windows / Linux, see table
  • Graphics API: Vulkan

I don't think a non-struct payload is intended to work, especially since it is always a struct in HLSL/D3D12 where the feature originated as well as that there is VU requiring only one variable per entry point so a non-struct is a somewhat degenerate case. I'll see about getting clarification. That language you quoted is in the proposal document not VUs, and I expect it's intended just to say that payloads are not required.

It looks like your radv version is very old, 23.2.1, have you tried testing on an up to date driver?

Updated my radv version to 24.0.7 (device), but still the very same error: capture_struct_payload capture_uint_payload

kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring comp_1.1.0 timeout, signaled seq=30249, emitted seq=30250
kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process qrenderdoc pid 26623 thread ReplayManager pid 26660

Seems like the comp rings can differ, previously I've only seen comp_1.2.0 but now I'm seeing comp_1.1.0 and comp_1.0.1, don't know if that's any significant though.

Anyhow, I also completed the table with more testing results on Windows. Notably encountered a RenderDoc crash on AMD with payload structs when selecting the mesh shader draw cmd iirc, see capture and dump in table.

The struct capture works fine for me on radv 24.0.7:

image

I'm running quite a different GPU to you so this might be something GPU-specific, but I don't see any validation warnings on the mesh output fetch and since it works for me I think you would need to report this to mesa. As far as I'm aware RenderDoc's mesh shader support is working OK for other people on mesa so this may be a device-specific bug. The mesa folks will be better placed to diagnose the problem and report back to me if it's a RenderDoc bug after all.

I get a similarly successful result on amdvlk, so it may be something in common between the two.

The windows AMD bug looks like a crash I have previously reported to them, though it may be different so it may be worth reporting to them yourself.

Mesa bug report: https://gitlab.freedesktop.org/mesa/mesa/-/issues/11156

When debugging with RADV_DEBUG=hang it interestingly states it's this pipeline, so most likely not even a RenderDoc bug [...]

For further investigation I got myself a clean system and managed to reproduce the bug there as well. However, some special conditions seem to be required for it to trigger. Could you please try to reproduce it again with these new repo instructions?

  1. Open the RADV capture and observe Renderdoc working as expected
  2. Install AMDVLK deb package on your system
  3. Delete /etc/vulkan/implicit_layer.d/amd_icd64.json to remove the VK_LAYER_AMD_switchable_graphics_64 implicit layer, which forces you to always use the amdvlk driver
  4. verify that vulkanCapsViewer can see both drivers, RADV with AMD Radeon Graphics (RADV REMBRANDT) and amdvlk with AMD Radeon Graphics (what a stupid naming)
  5. Open the same RADV capture again, but this time observe Renderdoc freezing, likely followed by a GPU reset or system freeze

My current conclusion is that an amdvlk device being available, even though it is unused, is enough to cause the Renderdoc to freeze. I wanted to run that issue by you, in case it's something to do with RenderDoc's device selection, before going back to asking the RADV team.

Here's a log of "Open Capture with Options" with the RADV device explicitly selected and API validation turned on:
RenderDoc_2024.05.16_16.07.32.log

There's no way I can see for just having a driver installed to cause a crash because of a RenderDoc bug. RenderDoc by default selects the closest matching physical device by hardware and driver, overriding it is not recommended but in either event the presence of other drivers won't cause a problem there either. Unless you can specifically find evidence of a RenderDoc bug I don't think it seems possible.

I see the mesa issue has been closed so I will close this one as well.