AcademySoftwareFoundation / OpenShadingLanguage

Advanced shading language for production GI renderers

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Crash creating and using OSLCompiler instances in mutliple threads

brechtvl opened this issue · comments

Problem

In the Cycles renderer we are getting crashes doing different renders in different threads, each creating their own OSLCompiler instances. From #654 it appears that this is supposed to work.

I think I understand the cause and am willing to contribute a fix, but looking for advice on the best way to solve this.

What happens is that OSLCompiler ends up calling this in its destructor:

void
SymbolTable::delete_syms()
{
    for (auto& sym : m_allsyms)
        delete sym;
    m_allsyms.clear();
    TypeSpec::struct_list().clear();
}

This TypeSpec::struct_list() is a global variable used by OSLCompiler, and also ShadingSystem. So fully clearing it here is not safe.

I can think of a few solutions:

  1. Never clear this list and accept the memory usage. ShadingSystem does not clear it as far as I can tell, so you already have this when not compiling shaders.
  2. Add reference counting to only clear this list when there are zero classes using it. This also involves adding mutex protection in a bunch of places.
  3. Add a struct list per OSLCompiler and ShadingSystem, instead of making it global.

Solution (3) seems like the best to me, does that seem reasonable? Or is there some performance reason that this should be reused.

Versions

  • OSL branch/version: master branch, 6b6637e
  • OS: Ubuntu Linux
  • C++ compiler: Clang

1 - seems reasonable to me; I don't imagine that there are enough unique struct definitions to worry about the memory "leak."

2 - seems like unnecessary work, and TypeSpec only stores the index (as a uint16!), not a pointer, so that would have to change to accommodate a ref-counted pointer.

3 - the problem here is that it's referenced by the TypeSpec (by index, as I said), and that's the whole reason why the struct table is a global singleton -- because the TypeSpec doesn't have a back-pointer to which OSLCompiler or ShadingSystem it belongs to. Having each TypeSpec be specific to one instance of a compiler or shading system doesn't seem fatal, but it does seem potentially error-prone for a renderer that's juggling multiple compilers, because the StructSpec that it refers to also contains a scope ID, which is specific to the compiler, ick.

I would be inclined to favor choice (1) as the short term solution to your dilemma and just get you unstuck. There's no down side except for a very small amount of leaked memory, I think. And then longer term, think about how we can disentangle it all without needing either ref-counted pointers or back-pointers to shading systems.

But before we move on...

Can you explain a bit about WHY you are using different compilers? I wonder if the correct long-term solution is just to fix OSLCompiler to be re-entrant for whatever operation you need to be doing in parallel.

We have a compiler as part of rendering, and multiple renders might be running in parallel (for example a viewport render, and a small preview render for a material). Within each render we might also be compiling multiple shaders in parallel. We also have a button to compile a shader from the text editor. Maybe in the future we use OSL for texture patterns for other parts of the application, like texture painting. Those might all run in parallel.

We're using multiple OSL compilers mainly because the code is simpler that way. We could have some mechanism share an OSLCompiler globally across the application, but it's extra work. The other reason is that OSLCompiler::compile is not re-entrant so it didn't work anyway.

We've been using multiple instances ofOSLCompiler and ShadingSystem in parallel for a long time, and it has been stable as far as I know. I think we just have not hit cases where multiple shaders are using structs much.

Making OSLCompiler re-entrant seems fine. But from my point of view as a user of the OSL API, it's easier to just create an OSLCompiler instance on the stack whenever I need it. So it's not clear to me that's actually worth spending time on.

I'm fine contributing a patch for (1).

I'm rereading this all now, and I'm worried about solution (1) because it would also need additional mutex operations to make that singleton thread-safe, and that could be complicated.

Now I'm tempted to favor (3): make the struct_list be part of the SymbolTable (owned by the compiler) so it's owned and used by one and only one compiler at a time. So TypeSpec::structspec() would need a second parameter, a reference to the SymbolTable. That seems logical, I'm just not sure I know all the places that will in turn need to be modified. Maybe it's not so bad?

I'm tempted to give it a try and see how far I get. Unless somebody else has time and wants to try it.

Update: I've been trying this and boy is it tricky. I've done like 3 rewrites so far and keep running into corners that are hard to back out of. It's kind of amazing how much rests on the assumption of a single structure table. But I'm still working at it.

You can also not compile twice with the same OSLCompiler object.
It seems like anything in an #include gets included twice when you do?
Do you want me to make a separate GitHub issue about this, or can we keep it here @lgritz ?

Update: I've been trying this and boy is it tricky. I've done like 3 rewrites so far and keep running into corners that are hard to back out of. It's kind of amazing how much rests on the assumption of a single structure table. But I'm still working at it.

We tested the 1 fix suggested by Brecht and it "seems to work". Isn't that the simplest?

Also, I am wondering, all these talk about a global struct list, does it mean that you can't have two different structs with the same name? :)