Usage of EGL API while faking GLX - the GLX context gets lost

Question

Usage of EGL API while faking GLX - the GLX context gets lost

peci1 opened this issue 2 years ago · comments

Gazebo is an app that allows the user to choose between GLX and EGL-based rendering. For some use-cases, it would, however, be more practical to use GLX faked by VirtualGL EGL backend.

In issue gazebosim/gz-rendering#526 we are debugging why the GLX backend loses the current GL context in some setups.

I've traced it down to Gazebo first calling glXMakeCurrent() for the GLX context, and then probing EGL availability by calling eglMakeCurrent() on all available EGL devices (this is not intentional, but it's the way the OGRE rendering framework tests EGL PBuffer support, and there's probably no easy way around it). One of the eglMakeCurrent() calls will, however, reset the GLX context, which is thus lost.

The workaround on Gazebo side is straightforward - just store the GLX context before EGL probing, and restore it afterwards.

Thinking about a proper solution, I first thought VirtualGL could do nothing about this. But then it came to my mind that it could actually report some part of the EGL API unavailable for the card/display that is used for GLX faking. Would that make sense (at least as a configurable option)? I'm not, however, skillful enough to tell what part of the EGL API would need to be disabled. Actually, it seems to me some avoiding is tried (or even EGL emulation), but it is probably not enough in this case?

I've assembled a MWE showing the behavior. With the two lines with comment //FIX, everything works as expected - that's the store&restore context workaround. If you comment out these two lines, the current GLX context will get lost after the first eglMakeCurrent() call.

Compile with g++ -o mwe mwe.cpp -lGL -lGLU -lX11 -lEGL .

#include <GL/glx.h>
#include <X11/Xlib.h>
#include <X11/Xutil.h>
#include <iostream>
#include <EGL/egl.h>
#include <EGL/eglext.h>
#include <vector>

int main(int argc, char** argv)
{

auto dummyDisplay = XOpenDisplay(0);
Display *x11Display = static_cast<Display*>(dummyDisplay);
int screenId = DefaultScreen(x11Display);
int attributeList[] = {
    GLX_RENDER_TYPE, GLX_RGBA_BIT,
    GLX_DOUBLEBUFFER, True,
    GLX_DEPTH_SIZE, 16,
    GLX_STENCIL_SIZE, 8,
    None
};
int nelements = 0;
auto dummyFBConfigs = glXChooseFBConfig(x11Display, screenId, attributeList, &nelements);
auto dummyWindowId = XCreateSimpleWindow(x11Display, RootWindow(dummyDisplay, screenId), 0, 0, 1, 1, 0, 0, 0);
PFNGLXCREATECONTEXTATTRIBSARBPROC glXCreateContextAttribsARB = 0;
glXCreateContextAttribsARB = (PFNGLXCREATECONTEXTATTRIBSARBPROC)glXGetProcAddress((const GLubyte *)"glXCreateContextAttribsARB");
int contextAttribs[] = {
        GLX_CONTEXT_MAJOR_VERSION_ARB, 3,  //
        GLX_CONTEXT_MINOR_VERSION_ARB, 3,  //
        None                               //
};
auto dummyContext = glXCreateContextAttribsARB(x11Display, dummyFBConfigs[0], nullptr, 1, contextAttribs);
// Create the GLX context and set it as current
GLXContext x11Context = static_cast<GLXContext>(dummyContext);
glXMakeCurrent(x11Display, dummyWindowId, x11Context);
std::cerr << glXGetCurrentContext() << " " << glXGetCurrentDrawable() << " " << glXGetCurrentDisplay() << std::endl;


typedef EGLBoolean ( *EGLQueryDevicesType )( EGLint, EGLDeviceEXT *, EGLint * );
auto eglQueryDevices = (EGLQueryDevicesType)eglGetProcAddress( "eglQueryDevicesEXT" );
auto eglQueryDeviceStringEXT = (PFNEGLQUERYDEVICESTRINGEXTPROC)eglGetProcAddress( "eglQueryDeviceStringEXT" );
EGLint numDevices = 0;
eglQueryDevices( 0, 0, &numDevices );

std::vector<EGLDeviceEXT> mDevices;
if( numDevices > 0 )
{
      mDevices.resize( static_cast<size_t>( numDevices ) );
      eglQueryDevices( numDevices, mDevices.data(), &numDevices );
}

for( int i = 0u; i < numDevices; ++i )
{
   EGLDeviceEXT device = mDevices[size_t( i )];
   auto name = std::string(eglQueryDeviceStringEXT( device, EGL_EXTENSIONS ));

   const char *gpuCard = eglQueryDeviceStringEXT( device, EGL_DRM_DEVICE_FILE_EXT );
   if( gpuCard ) name += std::string(" ") + gpuCard;

   std::cerr << i << " " << name << std::endl;

   EGLAttrib attribs[] = { EGL_NONE };
   auto eglDisplay = eglGetPlatformDisplay( EGL_PLATFORM_DEVICE_EXT, mDevices[i], attribs );
   EGLint major = 0, minor = 0;
   eglInitialize( eglDisplay, &major, &minor );

   const EGLint configAttribs[] = {
            EGL_SURFACE_TYPE,    EGL_PBUFFER_BIT, EGL_BLUE_SIZE, 8, EGL_GREEN_SIZE, 8, EGL_RED_SIZE, 8,
            EGL_RENDERABLE_TYPE, EGL_OPENGL_BIT,  EGL_NONE
   };

   EGLint numConfigs;
   EGLConfig eglCfg;
   eglChooseConfig( eglDisplay, configAttribs, &eglCfg, 1, &numConfigs );
   const EGLint pbufferAttribs[] = {
            EGL_WIDTH, 1, EGL_HEIGHT, 1, EGL_NONE,
   };
   auto eglSurf = eglCreatePbufferSurface( eglDisplay, eglCfg, pbufferAttribs );
   eglBindAPI( EGL_OPENGL_API );
   EGLint contextAttrs[] = {
            EGL_CONTEXT_MAJOR_VERSION,
            4,
            EGL_CONTEXT_MINOR_VERSION,
            5,
            EGL_CONTEXT_OPENGL_PROFILE_MASK,
            EGL_CONTEXT_OPENGL_CORE_PROFILE_BIT_KHR,
            EGL_NONE
   };

   // Create the EGL context and make it current
   auto eglCtx = eglCreateContext( eglDisplay, eglCfg, 0, contextAttrs );
   std::cerr << glXGetCurrentContext() << " " << glXGetCurrentDrawable() << " " << glXGetCurrentDisplay() << std::endl;
   auto ctx = glXGetCurrentContext(); auto dpy = glXGetCurrentDisplay(); auto drw = glXGetCurrentDrawable();  // FIX
   eglMakeCurrent( eglDisplay, eglSurf, eglSurf, eglCtx );
   glXMakeCurrent(dpy, drw, ctx);  // FIX
   std::cerr << glXGetCurrentContext() << " " << glXGetCurrentDrawable() << " " << glXGetCurrentDisplay() << std::endl;
}

}

DRC · Answer 1 · Tue Dec 20 2022 03:37:15 GMT+0800 (China Standard Time)

Thinking about a proper solution, I first thought VirtualGL could do nothing about this. But then it came to my mind that it could actually report some part of the EGL API unavailable for the card/display that is used for GLX faking. Would that make sense (at least as a configurable option)?

Probably the best that VGL could do is as follows:

Return EGL_FALSE and set the EGL error to EGL_BAD_CONTEXT if eglMakeCurrent() is passed an EGL context that is really an emulated GLX context (IOW, if the EGL context is registered with ContextHashEGL.)
Return 0 from eglGetCurrentDisplay(), eglGetCurrentSurface(), and eglGetCurrentContext() if the current EGL context is really an emulated GLX context. (Note that this would require interposing eglGetCurrentContext(), which we don't currently do.

Would that be sufficient to fix the problem from Gazebo's point of view?

DRC · Answer 2 · Tue Dec 20 2022 03:46:44 GMT+0800 (China Standard Time)

Never mind. I see now that that wouldn't fix the problem, and I unfortunately can't see any way to fix it. The application seems to be relying on a clean separation of OpenGL states between the two APIs, but I'm not sure if that's a valid assumption in general or whether it's implementation-specific. In any case, VirtualGL has no way to implement that separation when using the EGL back end.

DRC · Answer 3 · Tue Dec 20 2022 03:56:05 GMT+0800 (China Standard Time)

Also, I don't see why disabling the EGL API while a GLX context is current would be the right approach either, since that would cause the aforementioned EGL Pbuffer test to fail artificially.

Martin Pecka · Answer 4 · Tue Dec 20 2022 04:23:03 GMT+0800 (China Standard Time)

Thank you for your ideas. I was not sure if I'm requesting a sane thing or not.

Gazebo will always use either GLX xor EGL, never both. However, in the beginning, it does the probing that scrambles the already detected GLX context (as GLX is probed first). So for Gazebo, returning no EGL support when GLX is already being interposed would make sense.

However, I'm not sure how is it with mixing these two together in general - whether it is something valid, or if it's always a nonsense. Do you know some example where both GLX and EGL would be used in a single app (except for Gazebo)?

DRC · Answer 5 · Tue Dec 20 2022 07:03:08 GMT+0800 (China Standard Time)

I can't think of many reasons why an application would want to use both GLX and EGL, and I can't think of any reasons why an application would want to bind both types of contexts simultaneously. I'm not sure if that behavior is even explicitly defined, particularly if the EGL API is bound to the desktop OpenGL API. Only one OpenGL context can be current in the same thread at the same time, so after the call to eglMakeCurrent(), all desktop OpenGL commands should be directed to the EGL context and not the GLX context. Thus, even if glXGetCurrent*() returned values other than 0, those values would be meaningless. The OpenGL commands would not actually be directed to the display or drawables returned by glXGetCurrent*(), and the GLX context returned by glXGetCurrentContext() would not actually be current from the point of view of OpenGL. This is even more true because, in this specific case, device-based EGL is used. Thus, there is no guarantee that the EGL and GLX APIs are even addressing the same GPU or vendor libraries.

When using the EGL back end, what happens in the example is that, because eglMakeCurrent() is passed a display handle that doesn't correspond to an X11 display, VGL disables all of its interposers in that thread for as long as that context is current. (Only the EGL/X11 API needs to be emulated in a remote display environment. Device-based EGL works fine and needs no emulation/interposition.) Thus, the subsequent calls to glXGetCurrent*() are passed through to the underlying GLX implementation rather than emulated. If you are using an X proxy, then the underlying GLX implementation is Mesa, which means that it won't know about the GLX context at all (since the prior call to glXMakeCurrent() was emulated using device-based EGL and probably directed to a different set of vendor libraries.) Thus, the glXGetCurrent*() calls will return 0.

I think I can make VGL behave as Gazebo expects by:

maintaining a separate thread-local exclusion variable for OpenGL, GLX, and EGL
storing the emulated GLX context handle in a thread-local variable in the EGL back end, rather than relying on eglGetCurrentContext()

Thus, if eglMakeCurrent() was passed a non-X11 display handle, the glXGetCurrent*() functions would be unaffected.

Here is a patch that attempts to accomplish this.

However, before I am comfortable committing the patch, I need to understand a couple of things:

Is the patch sufficient to fix the problem from Gazebo's point of view?
Is there any documentation that suggests what the correct behavior is in this case? I don't want to emulate behavior that is in fact incorrect, and I think I made a pretty good case above that it is more correct for glXGetCurrent*() to return 0 when an EGL context bound to the OpenGL API is current.

I would also need to perform my own testing to make sure that the patch has no unforeseen negative consequences. That won't happen until I get answers to the two questions above, and because of the holidays, it probably won't happen this week in any case.

Martin Pecka · Answer 6 · Wed Dec 21 2022 04:25:02 GMT+0800 (China Standard Time)

Thank you for the patch.

I verified all combinations of faking via EGL/GLX and forcing Gazebo to use either GLX- or EGL-based rendering, and all possible combinations work. So even the EGL probing process is not disrupted by this patch.

ad 2. Could we say the correct behavior is whatever happens on a system with non-faked rendering? If so, then the MWE I've provided demonstrates exactly what should happen (the GLX context is unaffected by the EGL calls). I'm not sure, though, whether the behavior I'm seeing is platform-specific or if it behaves the same on all GPUs. My initial tests were done on a notebook with AMD Ryzen iGPU. Now I tested it on a desktop with NVidia 3090 and it behaves the same.

DRC · Answer 7 · Wed Dec 21 2022 04:41:49 GMT+0800 (China Standard Time)

I verified all combinations of faking via EGL/GLX and forcing Gazebo to use either GLX- or EGL-based rendering, and all possible combinations work. So even the EGL probing process is not disrupted by this patch.

With the patch, does the EGL probing process produce the same results as it would on a local machine without VGL?

ad 2. Could we say the correct behavior is whatever happens on a system with non-faked rendering? If so, then the MWE I've provided demonstrates exactly what should happen (the GLX context is unaffected by the EGL calls). I'm not sure, though, whether the behavior I'm seeing is platform-specific or if it behaves the same on all GPUs. My initial tests were done on a notebook with AMD Ryzen iGPU. Now I tested it on a desktop with NVidia 3090 and it behaves the same.

I observed the same thing with my Quadros and a FirePro, so the behavior is at least de facto with the most popular implementations, but that doesn't necessarily mean that it's correct. I can think of several reasons why returning anything other than 0 from glXGetCurrent*() would be dangerous if an EGL context bound to the desktop OpenGL API is current:

The calling program might assume that it can import fences from the display returned by glXGetCurrentDisplay() and use those for OpenGL synchronization purposes (via. GL_EXT_x11_sync_object).
The calling program might assume that, because glXGetCurrentContext() returns a context handle, it can use glXWaitGL() or glXWaitX() (or any number of other GLX functions that don't take a Display argument) with the current context.
The calling program might assume that, because glXGetCurrentDrawable() returns a drawable ID, it can obtain the OpenGL-rendered pixels via XGetImage() or other X11 functions.

All of those assumptions are patently false, irrespective of whether VGL is used.

Martin Pecka · Answer 8 · Wed Dec 21 2022 04:57:28 GMT+0800 (China Standard Time)

With the patch, does the EGL probing process produce the same results as it would on a local machine without VGL?

Yes. Exactly the same.

I understand why you'd rather return an invalid context after GLX and EGL have been mixed in a single program. Feel free to not fix this issue (or hide the fix behind a CLI flag). Until somebody comes with a proper example of an app using both GLX and EGL, I think it is hard to estimate what should be happening. Gazebo uses both at the beginning during the probing process, but then it sticks with one of the APIs. I'll push a fix to Gazebo that makes sure the GLX context is restored after the EGL probing.

DRC · Answer 9 · Wed Dec 21 2022 05:15:23 GMT+0800 (China Standard Time)

I have no problem integrating the fix as long as I can see documentation that at least suggests that I am doing the right thing. Let me think about it and do more googling before you go to the trouble of pushing a fix for Gazebo.

DRC · Answer 10 · Thu Dec 22 2022 04:46:35 GMT+0800 (China Standard Time)

The aforementioned patch has been integrated into the 3.1 (main) branch, and a subset of the patch that was applicable to VGL 3.0 has been integrated into the 3.0.x branch. Please verify that everything still works on your end with the latest Git commits. (If you could test both branches, that would be great.)

Martin Pecka · Answer 11 · Thu Dec 22 2022 22:39:50 GMT+0800 (China Standard Time)

Thank you!

I did thorough testing and everything looks good, no problems noticed. In particalar, I did a cartesian product of these options:

VGL 3.0.x branch | VGL main branch
Gazebo with fix gazebosim/gz-rendering#794 | Gazebo without the fix
VGL faking via the following devices: egl0, egl1, /dev/dri/card0, :1 | No VGL
Gazebo using GLX renderer | Gazebo set to choose EGL via --headless-rendering option
Gazebo server with rendering sensors | Gazebo GUI
Ubuntu 18.04
AMD Ryzen iGPU

In all cases, I ran the server with rendering sensors, the GUI, and watched the output of one of the rendering sensors in the GUI.

Just to explain a bit more why Gazebo is doing the probing process as it is doing it. It is actually not done directly by Gazebo, but by the OGRE 2.2 rendering engine's GlSwitchableSupport class. This class is meant to provide a generic GL interface where the rendering device and driver can be selected via runtime options. So it is technically not only specific to Gazebo, but to any OGRE-based app. However, it is apparent from the commit history, that GlSwitchableSupport was added because of Gazebo, and I haven't found any example of it being used elsewhere...