GPUOpen-Drivers / AMDVLK

AMD Open Source Driver For Vulkan

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

System freeze and/or crash in response to invalid sampler array access

amini-allight opened this issue · comments

Hi,

I've discovered a kind of invalid shader operation which can cause very disruptive effects. If you have an array of sampler2D objects and you access it via an invalid index that is past the end of the array it will cause the system to either freeze for a prolonged period of time or undergo a full GPU reset with the kernel message [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!. The bad index usually has to be passed in from another shader or provided by a uniform otherwise glslangValidator will catch the mistake at GLSL → SPIR-V conversion time. I don't believe this behavior is a violation of any standard, accessing past the end of a sampler2D array is undefined behavior, but it makes development unnecessarily difficult if your machine crashes whenever you make a mistake and it could cause ungraceful failure states in gaming applications as well.

I've created a program that demonstrates this issue, accessible on my GitLab here. You should run this at your own risk, due to the nature of the bug being demonstrated it may cause your system to crash.

  • This demo achieves the crash by having the vertex shader provide a bad index to the fragment shader, which then uses it to try read texture data. If you change the index to a reasonable value like 0 the issue no longer occurs.
  • A partially bound array is used because those were the conditions under which I found this bug but I am unsure if it is necessary for the bug to occur.
  • This demo seems to only cause the freezing failure mode rather than the full GPU reset one I have observed previously. I do not understand why this is.

Platform Info:

  • The issue is present in AMDVLK version 2023.Q3.1 and Linux kernel version 6.4.12, likely earlier versions as well as I discovered this a few months ago but haven't had time to report it until now.
  • This issue was detected on an RX 6900 XT. Other cards were not tested.
  • The issue is not present under RADV, with RADV the system freezes only briefly before the program crashes with VK_ERROR_DEVICE_LOST and the system recovers. Other platforms and runtimes were not tested.

Hi @amini-allight , I use the latest RADV driver, but I also observe the same Hang issue.

Hi @amini-allight , Under both of the AMDVLK and RADV drivers, all trigger the gpu hang and the system soft recovered after a while.
Actually from the dmesg, hang under AMDVLK and RADV are due to the same reasons.

Do you think I should report this as an issue with the kernel instead?

Hi @amini-allight , I don't think this is a driver (umd/kmd) issue.
In my humble opinion, you can't specify an invalid index, that is out of the sampler2D objects arrary's range. This against the Spec.

You think it's an inherent behavior of the hardware?

In my humble opinion, you can't specify an invalid index, that is out of the sampler2D objects arrary's range. This against the Spec.

Of course, I mentioned that in my original issue. Crashing in response to undefined behavior doesn't violate any specification but it does make development work on the platform significantly harder than it needs to be and that's a problem.

@amini-allight Yes, the invalid index will cause the undefined behavior based on the GLSL spec.

There is currently no such extension to support the out-of-bound inspection. So a better way is do more
check in app level or propose new extension.

I already addressed this in my original issue, there is more to software than conformance to the specification.

and it could cause ungraceful failure states in gaming applications as well.

yea I’m here exactly for this xD

the "hanging issue" can crash your video when you play videogames forcing you to restart you pc.