safeExit() is not safe

Question

safeExit() is not safe

tsondergaard opened this issue 2 years ago · comments

Thomas Sondergaard commented 2 years ago

safeExit() in server/faker.cpp can call std::exit(). Calling std::exit() is only safe in single-threaded applications and virtualgl can and almost certainly typically is used in multi-threaded application. The reason std::exit() is unsafe to call in multi-threaded application is that it will destroy objects with static storage duration before the process is destroyed. Other threads can still be executing while this happens and if they touch the data with static storage duration that has or is being destroyed a crash can easily happen.

Safe-exit looks like this:

void safeExit(int retcode)
{
	bool shutdown;

	globalMutex.lock(false);
	shutdown = deadYet;
	if(!deadYet)
	{
		deadYet = true;
		cleanup();
		fconfig_deleteinstance();
	}
	globalMutex.unlock(false);
	if(!shutdown) exit(retcode);
	else pthread_exit(0);
}

std::exit() should never be used. I think options are std::quick_exit() to exit reliably with an error code or abort() to get a crashdump with a relevant stack.

DRC · Answer 1 · Fri Dec 09 2022 00:36:40 GMT+0800 (China Standard Time)

Bearing in mind that many of these hacks in VGL are nearly 20 years old and were necessitated by supporting certain legacy platforms (<cough> Solaris 8) that we no longer need to support, I am open to modernizing how we handle global object creation/destruction. In fact, I have it on my to-do list to revisit all of that stuff.

However, playing devil's advocate, why is the specific implementation of safeExit() unsafe? The first thing it does is lock a global mutex, then examine the state of the global deadYet boolean. If VGL is shutting down unexpectedly, due to an error, then that boolean will be false. That means that the first thread to call safeExit() will lock the mutex, set deadYet to true, clean up all VGL-allocated global resources, then call exit(). VGL is written such that all of the entry points (interposed functions) and thread functions check the value of deadYet before trying to access any of the global resources. Thus, once the first thread calls deadYet and cleans up the global resources, the rest of the threads will immediately stop doing VGL stuff and will call safeExit() as soon as possible, and since deadYet is true at that point, safeExit() will call pthread_exit() for those threads. If there is a specific circumstance under which that does not happen, then that is a legitimate bug.

quick_exit() is only available in C++11 and later, so that's a non-starter, but any other suggestions are appreciated. Modern versions of VGL try their best to act like a legitimate GLX implementation, communicating non-recoverable GLX errors (such as bad arguments, etc.) using the X11 error handling mechanism. That mechanism will call exit() unless the 3D application has installed its own X11 error handler. The 3D application may also call exit() on its own. Thus, even if safeExit() never calls exit(), there are cases in which we can't prevent exit() from being called. VGL ideally needs to safely shut down all of its global resources any time that happens. At this point, we're only supporting ELF systems, so we could conceivably use DSO constructors/destructors to manage this. (Such wasn't possible 20 years ago.) However, I don't know if that would be any safer. There are still some circumstances whereby VGL will need to shut down both itself and the 3D application while multiple threads are in flight.

One of the reasons why VGL's global resource management is such a mess is MainWin, a Windows-on-Linux emulation layer used by certain proprietary CAD applications. MainWin uses DSO constructors/destructors and calls X11 functions from its destructor, which is called after the destructor for the VGL faker DSO is called. That's why GlobalCleanup exists in the VGL faker. It ensures that, after the VGL faker's destructor is called, deadYet will be true, and after that point, VGL will immediately hand off any interposed function calls to the underlying libraries. MainWin is also why DeferredCS exists.

Thomas Sondergaard · Answer 2 · Fri Dec 09 2022 02:02:37 GMT+0800 (China Standard Time)

However, playing devil's advocate, why is the specific implementation of safeExit() unsafe? The first thing it does is lock a global mutex, then examine the state of the global deadYet boolean. If VGL is shutting down unexpectedly, due to an error, then that boolean will be false. That means that the first thread to call safeExit() will lock the mutex, set deadYet to true, clean up all VGL-allocated global resources, then call exit().

That just means that VGL itself will be fine. Things doesn't go wrong until exit() is called. When that happens global objects with static storage duration will be destroyed. Lets say a thread has been created and its lifetime is tied to an object on the stack. Since the stack is not unwound that thread will remain alive will global objects are being destroyed. Calling exit() in a multi-threaded program is not generally safe. Only if no threads touches globals or the lifetime of is managed by objects with static storage duration will it be safe.

quick_exit() is only available in C++11 and later, so that's a non-starter, but any other suggestions are appreciated.

std::quick_exit() is nearly the same as _exit() from <cstdlib> so perhaps that can be used instead.

DRC · Answer 3 · Fri Dec 09 2022 03:05:19 GMT+0800 (China Standard Time)

So, just to clarify, you're saying that the danger is that the calling program (the 3D application) could create global resources, and those resources won't be cleaned up properly if VirtualGL calls exit()? _exit() only seems to differ from exit() in that it doesn't call any functions registered with atexit() or on_exit(), nor does it flush stdio streams. Presumably the global resources are eventually cleaned up between the time that _exit() or exit() is called and the process terminates, so I need to understand how the cleanup differs between the two functions and why the cleanup of those global resources is safer with _exit().

Hypothetically, VGL could always call the X11 error handler, which means that the 3D application would have to install its own X11 error handler in order to deal with X11 errors in a thread-safe manner (since, per above, the default X11 error handler calls exit().) In other words, VirtualGL wouldn't introduce any calls to exit() that aren't already introduced by the GLX and X11 APIs. However, VirtualGL will generally not call safeExit() unless there is a very serious problem, a problem that would create much more serious problems if it allowed the 3D application to treat the error as recoverable (which the 3D application would be free to do if it installs its own X11 error handler.)

At some point, I have to fall back upon the fact that no one has reported this as an actual bug in the 18-year history of VirtualGL, so in order to change VGL's behavior, I need to be confident that I am not introducing real bugs in the name of eliminating hypothetical ones.

karl kleinpaste · Answer 4 · Fri Dec 09 2022 03:15:07 GMT+0800 (China Standard Time)

One of the reasons why VGL's global resource management is such a mess is MainWin, a Windows-on-Linux emulation layer used by certain proprietary CAD applications.

If it makes you feel any better, MainWin's demise was announced in Aug 2021, and has been scheduled for Dec 31, 2023. However, it appears that unsupported use of its binaries will still be possible after that time.

DRC · Answer 5 · Fri Dec 09 2022 04:07:37 GMT+0800 (China Standard Time)

Yes, I suspect that the applications that used MainWin have long since moved to something else, but the fact remains that I have to be really careful about modifying how VirtualGL constructs and deconstructs things at the global level. #214 is an example of unforeseen proprietary application breakage resulting from the elimination of VGL's global X11 display hash in VGL 3.0, and I'm not yet sure how to fix it. Such changes to VGL should ideally be accompanied by additional unit tests that simulate the types of workflows that would fail without the changes.

karl kleinpaste · Answer 6 · Fri Dec 09 2022 04:24:27 GMT+0800 (China Standard Time)

I suspect that the applications that used MainWin have long since moved to something else

That would be an incorrect supposition, sorry to say. :-/

Thomas Sondergaard · Answer 7 · Fri Dec 09 2022 05:59:17 GMT+0800 (China Standard Time)

So, just to clarify, you're saying that the danger is that the calling program (the 3D application) could create global resources, and those resources won't be cleaned up properly if VirtualGL calls exit()?

Nope, I am saying this: A library that can be used by a multi-threaded application should never call exit().

Here is a demo. The example program starts a thread that does stuff and then in the main thread calls libraryFunctionThatCallsExitOnSomeUnrecoverableError():

#include <cstdlib>
#include <iostream>
#include <thread>

using namespace std::chrono_literals;

struct SlowToDie {
    ~SlowToDie() { std::this_thread::sleep_for(500ms); }
};

struct Logger {
    ~Logger() { std::cout << "****Logger destroyed*****\n"; }
    void log(const std::string &s) {
        std::cout << s;
        std::this_thread::sleep_for(100ms);
    }
};

SlowToDie slow_to_die;
Logger logger;

void libraryFunctionThatCallsExitOnSomeUnrecoverableError()
{
    std::this_thread::sleep_for(500ms);
    // Ooops, something bad happened, lets write a message and call exit
    std::cout << "Something bad happened. Calling exit()\n";
    exit(1);
}

int main(int argc, char **argv)
{
    std::thread t([] {
        for (int i = 0; i < 100; ++i)
        {
            logger.log("hello\n");
        }
    });

    libraryFunctionThatCallsExitOnSomeUnrecoverableError();

    t.join();
    return 0;
}

Running this doesn't crash on my machine, but it could as logger.log() is called after the logger object has been destroyed:

$ g++ -Wall -g -o example example.cc && ./example
hello
hello
hello
hello
hello
Something bad happened. Calling exit()
****Logger destroyed*****
hello
hello
hello
hello
hello

The actual situation I have observed is that the host runs out of GPU memory and then virtualgl prints this message:

[VGL] ERROR: glCheckFramebufferStatus() error 0x8cdd
[VGL] ERROR: in createBuffer
[VGL] 160: FBO is incomplete

and that is followed by a crash in some arbitrary thread because it touches a global object or something with a lifetime managed by a global object.

DRC · Answer 8 · Fri Dec 09 2022 07:59:47 GMT+0800 (China Standard Time)

The actual situation I have observed is that the host runs out of GPU memory and then virtualgl prints this message:
[VGL] ERROR: glCheckFramebufferStatus() error 0x8cdd
[VGL] ERROR: in createBuffer
[VGL] 160: FBO is incomplete
and that is followed by a crash in some arbitrary thread because it touches a global object or something with a lifetime managed by a global object.

That would have been a good thing to mention in the initial issue report, rather than describing the issue as if it were purely hypothetical. Now please answer the specific questions I posed above regarding why and how _exit() improves upon that. Surely the global resources will eventually be cleaned up as well if _exit() is called? So the issue is a matter of when those resources are cleaned up? I am trying to understand how best to address this in a way that won't break VirtualGL.

Thomas Sondergaard · Answer 9 · Fri Dec 09 2022 20:31:04 GMT+0800 (China Standard Time)

That would have been a good thing to mention in the initial issue report, rather than describing the issue as if it were purely hypothetical.

Sorry, you are right, that was a bad omission on my part. I forgot to carry that over from the issue in our jira.

Now please answer the specific questions I posed above regarding why and how _exit() improves upon that. Surely the global resources will eventually be cleaned up as well if _exit() is called?

Excerpt from man 3 exit: "All functions registered with atexit(3) and on_exit(3) are called, in the reverse order of their registration." Normally this is done at the end of main(), but when exit() is called by a library function it is possible for the program to be in a state where it is not safe to exit. I have provided an example where threads can crash, but it is also possible that the main thread or another thread is holding a resource that is needed by a function registered with atexit() in which case the program may hang in a deadlock.

Note that man 3 exit doesn't mention it, but for C++ programs all objects with static storage duration (global variables) are also destroyed by calling exit().

When std::quick_exit() or _exit() is used then the program doesn't destroy objects with static storage duration and call functions registered with atexit(). Instead the process exits immediately. The kernel will of course close all file descriptors and release all other resources that are associated by the process and that are managed by the kernel.

DRC · Answer 10 · Fri Dec 09 2022 22:32:25 GMT+0800 (China Standard Time)

Note that man 3 exit doesn't mention it, but for C++ programs all objects with static storage duration (global variables) are also destroyed by calling exit().

OK, but when are objects with static storage duration destroyed if an application (or VGL) calls _exit()? Surely they are destroyed eventually. When and how does that happen?

DRC · Answer 11 · Fri Dec 09 2022 22:36:18 GMT+0800 (China Standard Time)

To clarify: all dynamically allocated memory is eventually freed when an application exits. Presumably that happens within the body of _exit(), but what you said above suggests that the memory may be freed haphazardly rather than invoking the global destructors to allow those objects to shut down properly. That seems to me just as wrong an approach as what VGL is currently doing.

Thomas Sondergaard · Answer 12 · Fri Dec 09 2022 23:05:31 GMT+0800 (China Standard Time)

_exit() does not conduct an orderly shutdown, but neither does exit(). Neither unwind the stacks and lots of things have lifetimes that are stack managed. Hopefully more than is managed by global variables. With _exit() the program pretty much just sets the exit status and returns control to the kernel which will clean things up.

DRC · Answer 13 · Sat Dec 10 2022 00:15:21 GMT+0800 (China Standard Time)

Well, since we're discussing hypothetical problems, consider a hypothetical single-threaded 3D application that creates a shared memory segment or a temporary file via a global static class instance. When VGL calls exit(), the destructor for that global static class instance will be called, giving the application an opportunity to destroy the shared memory segment or temporary file. If VGL called _exit() instead, then the shared memory segment or temporary file would remain, and the kernel would not clean it up.

I frankly don't think that either approach (the existing approach or using _exit()) is the "right thing to do", but I'm not sure what the right thing is. Any other GLX implementation would simply invoke the X11 error handler and return. I suppose that VGL could, upon encountering a fatal faker error,

Clean up its own resources.
Invoke the X11 error handler. (BadImplementation seems like the only X11 error code that might fit.)
Set an error flag so that, if the 3D application ignores the X11 error or treats it as a non-fatal warning, any attempt to invoke an OpenGL, X11, XCB, OpenCL, or EGL function after that point would result in another X11 error.

With the default X11 error handler, this would not change VGL's behavior. (It would still invoke exit().) But it would at least give 3D applications the option of setting up an X11 error handler and dying gracefully. Of course, if a 3D application chose to ignore the X11 error or treat it as a non-fatal warning, then the application would probably go into an endless loop of spewing X11 errors, which isn't particularly friendly either.

Thomas Sondergaard · Answer 14 · Sat Dec 10 2022 00:27:35 GMT+0800 (China Standard Time)

Well, since we're discussing hypothetical problems, consider a hypothetical single-threaded 3D application that creates a shared memory segment or a temporary file via a global static class instance. When VGL calls exit(), the destructor for that global static class instance will be called, giving the application an opportunity to destroy the shared memory segment or temporary file. If VGL called _exit() instead, then the shared memory segment or temporary file would remain, and the kernel would not clean it up.

Yep, on the other hand the application may similar look like this:

int main() {
    // Temporary files destructor will clean up
    TemporaryFile temp_file;

     exit(1);
}

That temporary file will not be cleaned up. The example above is trivial , but in non-trivial programs there will be lot of resources that are built in such a way that resources are released when scopes are exited as the stack is unwound.

I think reporting errors back to the application via the API at hand - X11/GLX is the right approach, especially if the error handler is synchronous. That handler may call exit() if that suits the application in question, but I'd call std::quick_exit() in mine to avoid a crash moments later.

I don't know how difficult that would be to do. If it is difficult perhaps there could just be an environment variable VGL_FATAL_ERROR_HANDLER, which can be set to "exit", "_exit", or "abort"?

DRC · Answer 15 · Sat Dec 10 2022 01:10:54 GMT+0800 (China Standard Time)

It wouldn't be terribly difficult, since VGL already has the infrastructure in place to generate its own X11 errors. The error handler will be synchronous, in the sense that it is guaranteed to be invoked before the interposed function returns. (More specifically, VGL will call _XError() from Xlib, and _XError() will call the X11 error handler, if one is in place.)

Concerns:

I would also need to figure out how best to handle fatal faker errors that occur in the body of interposed OpenGL, OpenCL, XCB, and EGL functions.
- OpenGL provides no way for calling applications to set the OpenGL error (the error can only come from inside the OpenGL implementation), and the vast majority of OpenGL applications treat OpenGL errors as non-fatal. I suppose that I could do something artificial such as calling an OpenGL function with a bad argument, thus leaving a GL_INVALID_OPERATION error in the OpenGL error queue, then I could set the aforementioned faker error flag to ensure that any subsequent calls to interposed GLX and X11 functions generate an X11 error (so eventually the X11 error handler would be invoked.)
- With EGL functions, I could communicate the error back to the application via the function's return value, but the EGL API lacks a suitable error code.
- Not all XCB functions have a return value, and some of those functions are not expected to generate errors at all. Some XCB functions are also expected to always return a valid pointer, as opposed to returning NULL to indicate an error condition.
In essence, anything I do in this regard is going to be a violation of the API specs. Throwing an X11 BadImplementation error seems OK to me, because that error code is supposed to indicate that the X server doesn't support the X client's request. (A fatal faker error resulting in a faker shutdown effectively disconnects the X client from the 3D X server or the EGL-emulated 3D X server, so beyond that point, the "X server" won't support any request from a VGL-interposed function.) However, EGL, XCB, and OpenGL are very specific in regard to which errors can be generated by which functions.

DRC · Answer 16 · Wed Dec 14 2022 05:45:37 GMT+0800 (China Standard Time)

Unfortunately the idea of invoking API-specific error handlers, instead of exit(), for critical faker errors has some nasty implications. There are some low-level errors, such as the inability to find necessary symbols in the underlying libraries, that simply can't be handled, because those errors would prevent the APIs from being interposed by VGL at all. Also, there is a high likelihood that 3D applications would ignore API errors (particularly from OpenGL and XCB and maybe EGL as well), so there is a high likelihood that the changes I proposed above would break more than they would fix.

Potentially acceptable solutions:

the solution you proposed above, whereby an environment variable is used to specify that VirtualGL should call _exit() instead of exit() for critical faker errors (is abort() really necessary?)
a more generic solution whereby an environment variable is used to specify the name of an application-specific error callback function that VirtualGL should load (via dlsym()) and invoke instead of exit().
both of the above

I like both of those solutions, because they don't change the default behavior of VirtualGL (which works fine for the vast majority of applications.) Solution 1 would allow users to work around issues with specific 3D applications without changing the application source code. Solution 2 would allow developers to handle VGL errors and shut down their own 3D applications cleanly. I could see the need for both, but Solution 1 is a lot more foolproof. I can foresee several issues with Solution 2, including:

determining an appropriate callback API (and handling the potential need to change that API in the future), although maybe it's sufficient to just require that the callback have the same function signature as exit()
the fact that a callback won't really let applications unwind the stack (but neither would the X11 error handler, which also uses a callback)

I could also implement Solution 1 now and allow for it to be expanded (via the use of the same environment variable) to encompass Solution 2 later on.

Thomas Sondergaard · Answer 17 · Wed Dec 14 2022 15:14:13 GMT+0800 (China Standard Time)

I agree with all your points above. Keeping it simple is very attractive. I think abort() is a very nice option for unrecoverable errors as it allows you to get a coredump for later inspection. This can be helpful in development, but essential for errors that are only observed in production.

DRC · Answer 18 · Thu Dec 15 2022 02:13:06 GMT+0800 (China Standard Time)

It occurs to me that Solution 2 is unnecessary, because if an application is going to go to the trouble of explicitly registering an error handler for VGL, it could simply register an exit handler via atexit() in order to accomplish the same thing. I recognize that Solution 1 is still useful, because it changes VGL's exit behavior without requiring applications to modify their source code, but I also wonder aloud why you can't register an exit handler in your application. It is perfectly legal to call _exit() from within an exit handler.

DRC · Answer 19 · Thu Dec 15 2022 03:39:34 GMT+0800 (China Standard Time)

Solution 1 has been implemented via a new environment variable (VGL_EXITFUNCTION.)

Thomas Sondergaard · Answer 20 · Fri Dec 16 2022 21:17:32 GMT+0800 (China Standard Time)

Tested while testing #218. Works like a charm and was very useful with "abort" for getting a stacktrace where vgl raises the error.