microsoft / DirectXShaderCompiler

This repo hosts the source for the DirectX Shader Compiler which is based on LLVM/Clang.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Incorrect control flow modifications in ScopeNestedCFG pass cause WARP to have wrong shader behavior

jenatali opened this issue · comments

Description
A customer reported incorrect results when running a shader on WARP. Arbitrary tweaks to the source HLSL cause DXC to produce drastically different control flow (due to fewer static analysis opportunities), which can work around the problem. The root cause of WARP's incorrect behavior is due to a bug introduced by the ScopeNestedCFG pass that WARP uses to structurize the CFG before translating into its own IR.

Steps to Reproduce
I was given the attached PIX capture. That PIX capture contains HLSL sources (if you have issues viewing source from the capture, I can extract it and upload it separately). Running that PIX capture on a debug version of WARP, I was able to extract the IR before and after the ScopeNestedCFG passes. See before and after. Note that the before has already had the CFG simplification, loop simplification, and createDemoteRegisterToMemoryHlslPass (including breaking critical edges) passes run on it. The original is also available: original.

The key parts of these dumps to focus on are the end of the second nested loop:

Original:

; <label>:70                                      ; preds = %52, %51
  %71 = phi i32 [ %48, %51 ], [ %46, %52 ]
  %72 = getelementptr inbounds [32 x i32], [32 x i32]* @"\01?Scratch@ShaderTestPrivate@@3UPerThreadScratchData@1@A.3.1", i32 0, i32 %48
  %73 = load i32, i32* %72, align 4, !tbaa !10
  %74 = icmp eq i32 %73, 2
  br i1 %74, label %76, label %75

; <label>:75                                      ; preds = %70
  store i32 4, i32* %72, align 4, !tbaa !10
  br label %63

; <label>:76                                      ; preds = %70
  store i32 3, i32* %72, align 4, !tbaa !10
  store i32 2, i32* getelementptr inbounds ([32 x i32], [32 x i32]* @"\01?Scratch@ShaderTestPrivate@@3UPerThreadScratchData@1@A.3.1", i32 0, i32 2), align 4, !tbaa !10
  %77 = add nsw i32 %49, 1
  br label %45

Note that %72 is produced in one block, and then used in both the if/else blocks that succeed it. The IR prior to scope nested CFG looks similar:

; <label>:71                                      ; preds = %._crit_edge.9, %51
  %72 = load i32, i32* %.reg2mem87
  store i32 %72, i32* %.reg2mem21
  %.reload37 = load i32, i32* %.reg2mem34
  %73 = getelementptr inbounds [32 x i32], [32 x i32]* @"\01?Scratch@ShaderTestPrivate@@3UPerThreadScratchData@1@A.3.1", i32 0, i32 %.reload37
  %74 = load i32, i32* %73, align 4, !tbaa !10
  %75 = icmp eq i32 %74, 2
  br i1 %75, label %77, label %76

; <label>:76                                      ; preds = %71
  store i32 4, i32* %73, align 4, !tbaa !10
  %.reload23 = load i32, i32* %.reg2mem21
  %.reload35 = load i32, i32* %.reg2mem34
  store i32 %.reload35, i32* %.reg2mem88
  store i32 %.reload23, i32* %.reg2mem89
  br label %64

; <label>:77                                      ; preds = %71
  store i32 3, i32* %73, align 4, !tbaa !10
  store i32 2, i32* getelementptr inbounds ([32 x i32], [32 x i32]* @"\01?Scratch@ShaderTestPrivate@@3UPerThreadScratchData@1@A.3.1", i32 0, i32 2), align 4, !tbaa !10
  %.reload32 = load i32, i32* %.reg2mem30
  %78 = add nsw i32 %.reload32, 1
  store i32 %78, i32* %.reg2mem19
  %.reload20 = load i32, i32* %.reg2mem19
  %.reload22 = load i32, i32* %.reg2mem21
  store i32 %.reload20, i32* %.reg2mem91
  store i32 2, i32* %.reg2mem92
  store i32 2, i32* %.reg2mem93
  store i32 %.reload22, i32* %.reg2mem94
  br label %45

The pointer value is now %73 but the structure is the same.

The problem arises from the scope nested CFG pass, which starts to duplicate some of these basic blocks so that they can be appropriately nested inside each other for control flow reasons. It produces the following blocks:

; <label>:121                                     ; preds = %._crit_edge.9
  %122 = load i32, i32* %.reg2mem87
  store i32 %122, i32* %.reg2mem21
  %.reload37 = load i32, i32* %.reg2mem34
  %123 = getelementptr inbounds [32 x i32], [32 x i32]* @"\01?Scratch@ShaderTestPrivate@@3UPerThreadScratchData@1@A.3.1", i32 0, i32 %.reload37
  %124 = load i32, i32* %123, align 4, !tbaa !12
  %125 = icmp eq i32 %124, 2
  br i1 %125, label %132, label %dx.LoopExitHelper.123, !dx.BranchKind !18

; <label>:126                                     ; preds = %dx.EndIfScope.142
  %127 = load i32, i32* %.reg2mem87
  store i32 %127, i32* %.reg2mem21
  %.reload37.143 = load i32, i32* %.reg2mem34
  %128 = getelementptr inbounds [32 x i32], [32 x i32]* @"\01?Scratch@ShaderTestPrivate@@3UPerThreadScratchData@1@A.3.1", i32 0, i32 %.reload37.143
  %129 = load i32, i32* %128, align 4, !tbaa !12
  %130 = icmp eq i32 %129, 2
  br i1 %130, label %134, label %dx.LoopExitHelper.123.148, !dx.BranchKind !18

; <label>:131                                     ; preds = %dx.LoopExitHelperIf.125
  store i32 4, i32* %123, align 4, !tbaa !12
  %.reload23 = load i32, i32* %.reg2mem21
  %.reload35 = load i32, i32* %.reg2mem34
  store i32 %.reload35, i32* %.reg2mem88
  store i32 %.reload23, i32* %.reg2mem89
  br label %dx.EndIfScope.179

; <label>:132                                     ; preds = %121
  store i32 3, i32* %123, align 4, !tbaa !12
  store i32 2, i32* getelementptr inbounds ([32 x i32], [32 x i32]* @"\01?Scratch@ShaderTestPrivate@@3UPerThreadScratchData@1@A.3.1", i32 0, i32 2), align 4, !tbaa !12
  %.reload32 = load i32, i32* %.reg2mem30
  %133 = add nsw i32 %.reload32, 1
  store i32 %133, i32* %.reg2mem19
  %.reload20 = load i32, i32* %.reg2mem19
  %.reload22 = load i32, i32* %.reg2mem21
  store i32 %.reload20, i32* %.reg2mem91
  store i32 2, i32* %.reg2mem92
  store i32 2, i32* %.reg2mem93
  store i32 %.reload22, i32* %.reg2mem94
  br label %dx.LoopContinue.132

; <label>:134                                     ; preds = %126
  store i32 3, i32* %128, align 4, !tbaa !12
  store i32 2, i32* getelementptr inbounds ([32 x i32], [32 x i32]* @"\01?Scratch@ShaderTestPrivate@@3UPerThreadScratchData@1@A.3.1", i32 0, i32 2), align 4, !tbaa !12
  %.reload32.144 = load i32, i32* %.reg2mem30
  %135 = add nsw i32 %.reload32.144, 1
  store i32 %135, i32* %.reg2mem19
  %.reload20.145 = load i32, i32* %.reg2mem19
  %.reload22.146 = load i32, i32* %.reg2mem21
  store i32 %.reload20.145, i32* %.reg2mem91
  store i32 2, i32* %.reg2mem92
  store i32 2, i32* %.reg2mem93
  store i32 %.reload22.146, i32* %.reg2mem94
  br label %dx.LoopContinue.132.147

The whole CFG is too complicated to paste here, but you'll note that now there are two pointers, %123 produced in block 121, and %128 produced in block 126. There are also now 3 consumers, in blocks 131, 132, and 134, two of which are consuming %123 and one consuming %128. The problem is that there exists a path through the control flow which enters block 126 producing %128, and then block 131 consuming %123, but %123 wasn't produced.

The result is that WARP ends up writing the value 4 to the wrong location in memory, and when the logic in the outer loop attempts to read it, it reads the wrong value, resulting in the outer loop terminating too early.

And just for my own record before I lose context on this, the path through the final CFG that's broken is:

  • 126
  • dx.LoopExitHelper.123.148
  • dx.LoopExit.118
  • dx.LoopExitHelperIf.131
  • dx.LoopExitHelperIf.129
  • dx.LoopExitHelperIf.127
  • dx.LoopExitHelperIf.125
  • 131

@jenatali - we're not going to schedule time to investigate/fix this issue, but we would happily review and accept a PR addressing it (which is why it is on the dormant milestone).