AcademySoftwareFoundation / OpenShadingLanguage

Advanced shading language for production GI renderers

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Crash assigning boolean expression to `int` in batched mode

johnhaddon opened this issue · comments

Problem

OSL crashes when running expressions of the form int = float == float in batched mode, but not in non-batched mode. This was originally reported as a Gaffer bug in GafferHQ/gaffer#5430, but is reproducible using testshade alone (see below). The original bug report suggests the crash may occur only on certain CPUs. We've seen crashes on Intel(R) Xeon(R) W-2145 and Intel(R) Xeon(R) Silver 4216 CPU.

Steps to Reproduce

  1. Compile test.osl (see below) with oslc test.osl
  2. Run testshade --batched --res 128 128 -o result testout.txt test
  3. Observe stack trace (see below)
test.osl
shader test(
        float a1 = u,
        float a2 = v,
        output int result = 0
)
{
        result = a1 == a2;
}
Stack trace
new_val type=<16 x i1> dest_ptr type=<16 x i32>*
/home/john/dev/gafferDependencies/OpenShadingLanguage/working/OpenShadingLanguage-1.12.9.0/src/liboslexec/batched_backendllvm.cpp:1234: llvm_store_value: Assertion 'll.type_ptr(ll.llvm_typeof(new_val)) == ll.llvm_typeof(dst_ptr)' failed.
/home/john/dev/gafferDependencies/OpenShadingLanguage/working/OpenShadingLanguage-1.12.9.0/src/liboslexec/llvm_util.cpp:2958: native_to_llvm_mask: Assertion 'native_mask->getType() == type_native_mask()' failed.
Invalid operands for select instruction!
 %31 = select <16 x i1> %10, <16 x i1> %26, <16 x i32> %30, !dbg !12
Stored value type does not match pointer operand type!
 store <16 x i1> %31, <16 x i32>* %29, align 16, !dbg !12
<16 x i32>LLVM ERROR: Broken module found, compilation aborted!
0# OpenImageIO_v2_4::Sysutil::stacktrace() in /home/john/dev/build/gaffer-1.3/lib/libOpenImageIO_Util.so.2.4
1# 0x00007F9C23E6A79B in /home/john/dev/build/gaffer-1.3/lib/libOpenImageIO_Util.so.2.4
2# 0x00007F9C1BA2E400 in /lib64/libc.so.6
3# gsignal in /lib64/libc.so.6
4# abort in /lib64/libc.so.6
5# 0x00007F9C216680E5 in /home/john/dev/build/gaffer-1.3/lib/liboslexec.so.1.12
6# 0x00007F9C21668208 in /home/john/dev/build/gaffer-1.3/lib/liboslexec.so.1.12
7# 0x00007F9C2130F54F in /home/john/dev/build/gaffer-1.3/lib/liboslexec.so.1.12
8# 0x00007F9C212C1E97 in /home/john/dev/build/gaffer-1.3/lib/liboslexec.so.1.12
9# 0x00007F9C2141D26B in /home/john/dev/build/gaffer-1.3/lib/liboslexec.so.1.12
10# OSL_v1_12::pvt::LLVM_Util::prune_and_internalize_module(std::unordered_set<llvm::Function*, std::hash<llvm::Function*>, std::equal_to<llvm::Function*>, std::allocator<llvm::Function*> >, OSL_v1_12::pvt::LLVM_Util::Linkage, std::string*) in /home/john/dev/build/gaffer-1.3/lib/liboslexec.so.1.12
11# OSL_v1_12::pvt::BatchedBackendLLVM::run() in /home/john/dev/build/gaffer-1.3/lib/liboslexec.so.1.12
12# OSL_v1_12::pvt::ShadingSystemImpl::Batched<16>::jit_group(OSL_v1_12::ShaderGroup&, OSL_v1_12::ShadingContext*) in /home/john/dev/build/gaffer-1.3/lib/liboslexec.so.1.12
13# OSL_v1_12::ShadingContext::Batched<16>::execute_init(OSL_v1_12::ShaderGroup&, int, OSL_v1_12::Wide<int const, 16>, OSL_v1_12::BatchedShaderGlobals<16>&, void*, void*, bool) in /home/john/dev/build/gaffer-1.3/lib/liboslexec.so.1.12
14# OSL_v1_12::ShadingContext::Batched<16>::execute(OSL_v1_12::ShaderGroup&, int, OSL_v1_12::Wide<int const, 16>, OSL_v1_12::BatchedShaderGlobals<16>&, void*, void*, bool) in /home/john/dev/build/gaffer-1.3/lib/liboslexec.so.1.12
15# OSL_v1_12::ShadingSystem::BatchedExecutor<16>::execute(OSL_v1_12::ShadingContext&, OSL_v1_12::ShaderGroup&, int, OSL_v1_12::Wide<int const, 16>, OSL_v1_12::BatchedShaderGlobals<16>&, void*, void*, bool) in /home/john/dev/build/gaffer-1.3/lib/liboslexec.so.1.12

Versions

  • OSL branch/version: 1.12.9
  • OS: CentOS 7
  • C++ compiler: GCC 9.3.1
  • LLVM version: 11.1.0
  • OIIO version: 2.4.11

Can you please check this out, @AlexMWells?

Duped:

testshade --batched --res 128 128 -o result testout.txt test

Output result to testout.txt
/nfs/site/home/amwells/OSL_Dev/github/OpenShadingLanguage/src/liboslexec/llvm_util.cpp:4511: op_scatter: Assertion 'wide_val->getType() == type_wide_int()' failed.
testshade: /nfs/pdx/home/amwells/Pixar/OSL/llvm-11.1.0.src/lib/IR/Instructions.cpp:1423: void llvm::StoreInst::AssertOK(): Assertion `getOperand(0)->getType() == cast<PointerType>(getOperand(1)->getType())->getElementType() && "Ptr must be a pointer to Val type!"' failed.

BatchedAnalysis is responsible for determining which symbols can be forced to be represented as llvm booleans internally.
This is useful to keep vector of boolean results inside mask registers vs. expanding/compressing back and forth to an array of integers on the stack.

BatchedAnalysis behavior, identify all symbols written to by operations whose result will always logically be boolean.
The result of the op_eq is always logically bool, so it is tracked in m_symbols_logically_bool.
And when the result of any operation could NOT always be logically boolean, that symbol is tracked in m_symbols_disqualified_from_bool.

Afterwards any m_symbols_logically_bool that don't exist in m_symbols_disqualified_from_bool are marked as Symbol::forced_llvm_bool(true)

In this case,
result = a1 == a2;
the op_eq(a1,a2) result is logically is always boolean, so the symbol "result" is tracked in m_symbols_logically_bool.
And no other operations utilize the symbol "result" so nothing was disqualifying it so it ended up forced to boolean.

Adding #define OSL_DEV to batched_analysis.cpp dumps the analysis results:

Emit Symbols forced to llvm bool
--->0x31ba9b0 result is forced_llvm_bool
done with Symbols forced to llvm bool

At the end of shader execution, the final step is copying output placement where symbol values are copied out to memory, and that is where we end up loading a vector of 16 bools (the forced_llvm_bool result) and trying to store it as 16 integers where it crashes.

Original intent was for shader outputs connected between shader layers to be left as boolean if they could be.
But nothing was disqualifying renderer outputs from being forced to be boolean.

Modifying BatchedAnalysis::establish_symbols_forced_llvm_bool to excluded renderer outputs should fix the issue.

    void establish_symbols_forced_llvm_bool()
    {
        for (Symbol* logical_bool_sym : m_symbols_logically_bool) {
            if (m_symbols_disqualified_from_bool.find(logical_bool_sym)
                == m_symbols_disqualified_from_bool.end()) {
                // Do not allow outputs that will be pulled by the renderer to 
                // be forced to boolean as it would complicate accessing or 
                // copying them out.  This could be relaxed if only OSL 
                // controlled copy placement were allowed as it "could" handle 
                // the required bool->int expanson.
                if ((logical_bool_sym->symtype() == SymTypeOutputParam) && 
                    logical_bool_sym->renderer_output())
                    continue;
                logical_bool_sym->forced_llvm_bool(true);
            }
        }
    }

With this change, my local test.osl doesn't crash and now reports no symbols are force to llvm bool:

Emit Symbols forced to llvm bool
done with Symbols forced to llvm bool

If you could, please try this change and see if it resolves your issue.

Put the fix into a PR
#1717

Thanks for the speedy response! I can confirm that the PR fixes the exact testshade crash I reported here, but unfortunately it doesn't fix the original problem we distilled it from. Here's a slightly less distilled version that still crashes for me :

testshade --batched --res 128 128 -layer testLayer test --layer constantLayer constant --connect testLayer result constantLayer x

And here's the source for the constant shader used in the new test :

surface constant
(
	float x = 1
)
{
	Ci = x * emission();
}

And the stack trace :

new_val type=<16 x i1> dest_ptr type=<16 x i32>*
/home/john/dev/gafferDependencies/OpenShadingLanguage/working/OpenShadingLanguage-1.12.9.0/src/liboslexec/batched_backendllvm.cpp:1234: llvm_store_value: Assertion 'll.type_ptr(ll.llvm_typeof(new_val)) == ll.llvm_typeof(dst_ptr)' failed.
/home/john/dev/gafferDependencies/OpenShadingLanguage/working/OpenShadingLanguage-1.12.9.0/src/liboslexec/llvm_util.cpp:2958: native_to_llvm_mask: Assertion 'native_mask->getType() == type_native_mask()' failed.
/home/john/dev/gafferDependencies/OpenShadingLanguage/working/OpenShadingLanguage-1.12.9.0/src/liboslexec/llvm_util.cpp:5567: op_bool_to_float: Assertion '0 && "Op has bad value type combination"' failed.
 0# OpenImageIO_v2_4::Sysutil::stacktrace() in /home/john/dev/build/gaffer-1.3/lib/libOpenImageIO_Util.so.2.4
 1# 0x00007F070680B79B in /home/john/dev/build/gaffer-1.3/lib/libOpenImageIO_Util.so.2.4
 2# 0x00007F06FE0A0400 in /lib64/libc.so.6
 3# OSL_v1_12::pvt::LLVM_Util::op_int_to_float(llvm::Value*) in /home/john/dev/build/gaffer-1.3/lib/liboslexec.so.1.12
 4# OSL_v1_12::pvt::BatchedBackendLLVM::llvm_load_value(llvm::Value*, OSL_v1_12::pvt::TypeSpec const&, int, llvm::Value*, int, OpenImageIO_v2_4::TypeDesc, bool, bool, bool) in /home/john/dev/build/gaffer-1.3/lib/liboslexec.so.1.12
 5# OSL_v1_12::pvt::BatchedBackendLLVM::llvm_load_value(OSL_v1_12::pvt::Symbol const&, int, llvm::Value*, int, OpenImageIO_v2_4::TypeDesc, bool, bool) in /home/john/dev/build/gaffer-1.3/lib/liboslexec.so.1.12
 6# OSL_v1_12::pvt::BatchedBackendLLVM::llvm_assign_impl(OSL_v1_12::pvt::Symbol const&, OSL_v1_12::pvt::Symbol const&, int, int, int) in /home/john/dev/build/gaffer-1.3/lib/liboslexec.so.1.12
 7# OSL_v1_12::pvt::BatchedBackendLLVM::build_llvm_instance(bool) in /home/john/dev/build/gaffer-1.3/lib/liboslexec.so.1.12
 8# OSL_v1_12::pvt::BatchedBackendLLVM::run() in /home/john/dev/build/gaffer-1.3/lib/liboslexec.so.1.12

Duped that issue, key was to build and run for AVX512, which uses those native masks, I'll look into it.

@johnhaddon
I'm working on a more comprehensive solution, but in the meantime we can just disallow any params from being forced to be boolean in llvm.
Can you try this (to get you unblocked):

    void establish_symbols_forced_llvm_bool()
    {
        for (Symbol* logical_bool_sym : m_symbols_logically_bool) {
            if (m_symbols_disqualified_from_bool.find(logical_bool_sym)
                == m_symbols_disqualified_from_bool.end()) {
                // Do not allow params to be forced to boolean, as they are stored in GroupData structure and supporting
                // native llvm booleans there is a bit more work todo.
                if ((logical_bool_sym->symtype() == SymTypeOutputParam) ||  (logical_bool_sym->symtype() == SymTypeParam))
                    continue;
                logical_bool_sym->forced_llvm_bool(true);
            }
        }
    }

I'm working on a more comprehensive solution

Thanks Alex, that's great to know!

Can you try this (to get you unblocked):

We've worked around the issue by modifying shaders for now (and crossing our fingers that nobody enters equivalent code in the various places Gaffer accepts OSL source code directly). So our plan for now at least is to sit tight and wait for the comprehensive solution. Thanks for suggesting the short-term workaround though.

#1717

updated to a comprehensive solution rearchitected BatchedAnalysis::establish_symbols_forced_llvm_bool and fixing some other issues as well (read its comments for full description).

Please try it and let us know if it fixes the issue(s).