JVM crashes on setting callback for GTK3 signals
praj-foss opened this issue · comments
Hello there!
I'm currently learning JNR by trying out various Linux libraries, most recently GTK3. I used this example as a reference and wrote the new demo that can be found here. But it crashes badly when I try to run it (using ./gradlew gtk3:run
). Here's the crash log: hs_err_pid7667.log. I use GraalVM 21.1.0 as my JDK 11, on a x86_64 Linux machine (opensuse tumbleweed). My installed GTK version is 3.24.30-2.3.
I can see that it crashes on line 31 of Gtk3App.java
where I call from Java
lib.g_signal_connect_data(application, "activate", onActivate, null, null, 0);
The onActivate
is a lambda looking like this:
LibGtk3.GCallback onActivate = (app, data) -> {
var window = lib.gtk_application_window_new(app);
var button = lib.gtk_button_new_wih_label("Click me");
lib.gtk_container_add(window, button);
lib.gtk_widget_show_all(window);
};
which is supposed to act like a function pointer similar to on_app_activate
from my C reference:
// callback function which is called when application is first started
static void on_app_activate(GApplication *app, gpointer data) {
// create a new application window for the application
// GtkApplication is sub-class of GApplication
// downcast GApplication* to GtkApplication* with GTK_APPLICATION() macro
GtkWidget *window = gtk_application_window_new(GTK_APPLICATION(app));
// a simple push button
GtkWidget *btn = gtk_button_new_with_label("Click Me!");
// connect the event-handler for "clicked" signal of button
g_signal_connect(btn, "clicked", G_CALLBACK(on_button_clicked), NULL);
// add the button to the window
gtk_container_add(GTK_CONTAINER(window), btn);
// display the window
gtk_widget_show_all(GTK_WIDGET(window));
}
I also had a look at #231 and read the suggestions there to define onActivate
as public static final
variable, but it still didn't stop the crash. I don't have much idea about why it's crashing, my previous example seemed to work fine with callbacks. It might be an issue specific to GTK3 and its thread management or using GraalVM as JDK, but again I have zero ideas. Please try running the example if you're on a Linux machine and let me know where's the problem.
I suspect there's an alignment or width issue with the arguments, but we need to dig deeper to know for sure.
Can you provide an example, perhaps as a small repository, that I can build and use to reproduce this?
Sure, you can see the repository at https://github.com/praj-foss/jnr-demo. The target code is present under the gtk3
directory, and you can try running it using ./gradlew gtk3:run
.
Also, there are some changes: I used jnr.ffi.ObjectReferenceManager
to store a pointer to my original callback (suggested by the discussion in #231), and used that to pass the function pointer to native methods. It still crashes but the stack trace is different now. Please have a look at the updated crash log: hs_err_pid21253.log
On a side note, I'm actually writing JNR examples for my blog and I'd be happy to contribute to the official docs/examples. Please let me know if I can be of any help.
I have managed to reproduce on MacOS and @enebo is confirming that it reproduces on Linux.
If you are good with C libraries, getting a debug build of GTK3 and seeing where it segfaults would clearly be a great help.
On a side note, I'm actually writing JNR examples for my blog and I'd be happy to contribute to the official docs/examples. Please let me know if I can be of any help.
That would be fantastic! We do not get a lot of time to document the library, and our uses of JNR are pretty stable and do not require much maintenance so we rarely run into the edge cases users like you will see.
Interestingly, setting the callback to null, so it would be passed in as a null pointer, produces a different result: gtk catches the null handler and asserts:
(process:76291): GLib-GObject-CRITICAL **: 15:31:39.822: g_signal_connect_data: assertion 'c_handler != NULL' failed
Seems to indicate that it is not necessarily the callback getting nulled out, since it should catch that. Bad memory location? Already collected and not honoring our attempts to keep the handler referenced?
This investigation is hampered by the fact that it seems the g_closure_marshal_VOID__VOID
function is generated code. Might need to loop in someone more familiar with GTK internals to get a good picture of what is happening here.
We have not had other reports of callbacks leading to SEGV so I am left speculating why this function seems to be getting a bad pointer.
DIsabling jnr-ffi's x86_64 ASM generation does not appear to improve the situation, assuming it is being passed through.
However... I looked closer at the error dumps and I'm seeing RAX set this this implausible value:
RAX=0xcafebabe778d1062 is an unknown value
Unknown indeed. The hex cafebabe
is used as the first four bytes of the Java .class format, but as far as I know it should not appear in any pointer references in memory. So this seems to be passing along some bogus data.
This seems to be the source of the bogus pointer value:
I believe this would indicate that either the DefaultObjectReferenceManager is not working properly, or this code is not using it properly.
@praj-foss Ok, this may be a flaw in how you are using the API, but I do not know enough about GTK to be certain.
I modified your final code to not use the pointer value returned, and it seems to get much further... far enough to trigger a different, probably MacOS-specific error:
diff --git a/gtk3/src/main/java/in/praj/demo/Gtk3App.java b/gtk3/src/main/java/in/praj/demo/Gtk3App.java
index 8195466..eaaf284 100644
--- a/gtk3/src/main/java/in/praj/demo/Gtk3App.java
+++ b/gtk3/src/main/java/in/praj/demo/Gtk3App.java
@@ -23,17 +23,18 @@ public class Gtk3App {
lib.gtk_get_major_version(), lib.gtk_get_minor_version(), lib.gtk_get_micro_version());
var application = lib.gtk_application_new("in.praj.demo.Gtk3App", 0);
- var onActivate = refs.add((LibGtk3.GCallback) (gobject, data) -> {
+ LibGtk3.GCallback callback = (gobject, data) -> {
var window = lib.gtk_application_window_new(gobject);
var button = lib.gtk_button_new_with_label("Click me");
lib.gtk_container_add(window, button);
lib.gtk_widget_show_all(window);
- });
+ };
+ var callbackKey = refs.add(callback);
- lib.g_signal_connect_data(application, "activate", onActivate, null, null, 0);
+ lib.g_signal_connect_data(application, "activate", callback, null, null, 0);
lib.g_application_run(application, 0, null);
- refs.remove(onActivate);
+ refs.remove(callbackKey);
lib.g_object_unref(application);
}
}
diff --git a/gtk3/src/main/java/in/praj/demo/LibGtk3.java b/gtk3/src/main/java/in/praj/demo/LibGtk3.java
index 72e3e3a..1c5f7ab 100644
--- a/gtk3/src/main/java/in/praj/demo/LibGtk3.java
+++ b/gtk3/src/main/java/in/praj/demo/LibGtk3.java
@@ -13,7 +13,7 @@ public interface LibGtk3 {
@u_int64_t long g_signal_connect_data(
Pointer instance,
String detailed_signal,
- Pointer c_handler,
+ GCallback c_handler,
Pointer data,
Pointer destroy_data,
int connect_flags);
> Task :gtk3:run FAILED
GTK version: 3.24.30
2021-11-22 19:46:32.589 java[81207:10304347] WARNING: NSWindow drag regions should only be invalidated on the Main Thread! This will throw an exception in the future. Called from (
0 AppKit 0x00007fff22d96ed1 -[NSWindow(NSWindow_Theme) _postWindowNeedsToResetDragMarginsUnlessPostingDisabled] + 352
1 AppKit 0x00007fff22d81aa2 -[NSWindow _initContent:styleMask:backing:defer:contentView:] + 1296
2 AppKit 0x00007fff22d8158b -[NSWindow initWithContentRect:styleMask:backing:defer:] + 42
3 AppKit 0x00007fff2308b83c -[NSWindow initWithContentRect:styleMask:backing:defer:screen:] + 52
4 libgdk-3.0.dylib 0x00000001026da4bb -[GdkQuartzNSWindow initWithContentRect:styleMask:backing:defer:screen:] + 59
5 libgdk-3.0.dylib 0x00000001026e7479 _gdk_quartz_display_create_window_impl + 1225
6 libgdk-3.0.dylib 0x00000001026c52ef gdk_window_new + 959
7 libgtk-3.0.dylib 0x000000012d202052 gtk_window_realize + 1010
8 libgtk-3.0.dylib 0x000000012cf38ec0 gtk_application_window_real_realize + 96
9 libgobject-2.0.0.dylib 0x000000010276a325 _g_closure_invoke_va + 309
10 libgobject-2.0.0.dylib 0x0000000102781202 g_signal_emit_valist + 1266
11 libgobject-2.0.0.dylib 0x0000000102781d22 g_signal_emit + 130
12 libgtk-3.0.dylib 0x000000012d1de603 gtk_widget_realize + 291
13 libgtk-3.0.dylib 0x000000012d201641 gtk_window_show + 81
14 libgobject-2.0.0.dylib 0x000000010276a096 g_closure_invoke + 278
15 libgobject-2.0.0.dylib 0x0000000102780346 signal_emit_unlocked_R + 1110
16 libgobject-2.0.0.dylib 0x000000010278181e g_signal_emit_valist + 2830
17 libgobject-2.0.0.dylib 0x0000000102781d22 g_signal_emit + 130
18 libgtk-3.0.dylib 0x000000012d1ddd64 gtk_widget_show + 212
19 ??? 0x00000001027fd1e3 0x0 + 4336898531
)
2021-11-22 19:46:32.601 java[81207:10304347] *** Assertion failure in BOOL NSScreenConfigurationInvalidateIfNeededForReason(_NSScreenConfigurationUpdateReason)(), NSScreenConfiguration.m:464
2021-11-22 19:46:32.632 java[81207:10304347] *** Terminating app due to uncaught exception 'NSInternalInconsistencyException', reason: 'NSScreen reconfig must only happen on the main thread.'
*** First throw call stack:
(
0 CoreFoundation 0x00007fff205df1db __exceptionPreprocess + 242
1 libobjc.A.dylib 0x00007fff20318d92 objc_exception_throw + 48
2 CoreFoundation 0x00007fff20608352 +[NSException raise:format:arguments:] + 88
3 Foundation 0x00007fff214042ec -[NSAssertionHandler handleFailureInFunction:file:lineNumber:description:] + 166
4 AppKit 0x00007fff22efaae5 +[_NSScreenConfiguration invalidateConfigurationIfNeededForReason:] + 309
5 AppKit 0x00007fff22efa8e9 _NSApplicationInvalidateScreenConfigurationIfNeeded + 173
6 AppKit 0x00007fff22efa7f6 -[NSApplication(ScreenHandling) _reactToDockChanged] + 130
7 AppKit 0x00007fff22efa05b _NSCGSDockMessageReceive + 268
8 HIToolbox 0x00007fff287d1bb6 _ZL12DockCallbackjjPvS_ + 1987
9 HIServices 0x00007fff257fa1ee dockClientNotificationProc + 217
10 SkyLight 0x00007fff24d14e15 _ZN12_GLOBAL__N_123notify_datagram_handlerEj15CGSDatagramTypePvmS1_ + 1071
11 SkyLight 0x00007fff24d13018 CGSSnarfAndDispatchDatagrams + 716
12 SkyLight 0x00007fff24fb2e46 SLSGetNextEventRecordInternal + 278
13 SkyLight 0x00007fff24e08cf5 SLEventCreateNextEvent + 9
14 HIToolbox 0x00007fff287b7a4f _ZL38PullEventsFromWindowServerOnConnectionjhP17__CFMachPortBoost + 45
15 HIToolbox 0x00007fff287c3faf FlushSpecificEventsFromQueue + 52
16 AppKit 0x00007fff22d6b6e4 +[NSEvent _discardTrackingAndCursorEventsIfNeeded] + 459
17 AppKit 0x00007fff22d6a442 -[NSApplication(NSEvent) _nextEventMatchingEventMask:untilDate:inMode:dequeue:] + 81
18 libgdk-3.0.dylib 0x00000001026e23ea poll_func + 186
19 libglib-2.0.0.dylib 0x000000012d6ca361 g_main_context_iterate + 433
20 libglib-2.0.0.dylib 0x000000012d6ca466 g_main_context_iteration + 102
21 libgio-2.0.0.dylib 0x000000012d85ef5d g_application_run + 541
22 ??? 0x00000001027fd0d9 0x0 + 4336898265
23 ??? 0x000000011576c6c0 0x0 + 4655072960
24 ??? 0x000000011576c705 0x0 + 4655073029
25 ??? 0x0000000115763849 0x0 + 4655036489
26 libjvm.dylib 0x0000000106bb22fb _ZN9JavaCalls11call_helperEP9JavaValueRK12methodHandleP17JavaCallArgumentsP6Thread + 637
27 libjvm.dylib 0x0000000106bf4335 _ZL17jni_invoke_staticP7JNIEnv_P9JavaValueP8_jobject11JNICallTypeP10_jmethodIDP18JNI_ArgumentPusherP6Thread + 290
28 libjvm.dylib 0x0000000106bf710e jni_CallStaticVoidMethod + 383
29 java 0x00000001022a5bac JavaMain + 2732
30 libsystem_pthread.dylib 0x00007fff2046d8fc _pthread_start + 224
31 libsystem_pthread.dylib 0x00007fff20469443 thread_start + 15
)
libc++abi: terminating with uncaught exception of type NSException
From the very little I know about GUI development on MacOS, this appears to be a problem further down the pipeline when it attempts to actually display something.
Perhaps you can try my diff on Linux and see if it works better?
I believe the value returned by the DefaultObjectReferenceManager is intended to just be an opaque reference to the object value, not a new or better pointer to the object in question. In this case, the resulting value is a bogus pointer starting with "0xCAFEBABE" bytes, leading to the peculiar RAX I mentioned above.
So I tried the diff here on Linux and it does crash differently now: hs_err_pid5098.log. Unfortunately, I'm still pretty inexperienced in both GTK and C/C++, so I couldn't figure out much from the logs. I do believe it has something to do with how GTK and GObject-system work internally since the normal way of creating JNR callbacks works fine in simpler use-cases.
I went through the official hello-world example of gtk3 and found that I missed implementing G_APPLICATION
macro, which is possibly affecting some runtime behaviour that might cause the issue:
app = gtk_application_new ("org.gtk.example", G_APPLICATION_FLAGS_NONE);
g_signal_connect (app, "activate", G_CALLBACK (activate), NULL);
status = g_application_run (G_APPLICATION (app), argc, argv);
g_object_unref (app);
From the docs:
PREFIX_OBJECT (obj), which returns a pointer of type PrefixObject. This macro is used
to enforce static type safety by doing explicit casts wherever needed. It also enforces
dynamic type safety by doing runtime checks.
I'll look into that soon and post an update.
I used the preprocessor output from gcc and added the necessary functions in LibGtk3
. This still changes nothing apparently, and the program crashes just like before. I've pushed the latest changes in the demo repo.
// Before preprocessing
int status = g_application_run(G_APPLICATION(app), argc, argv);
// After preprocessing
int status = g_application_run(((((GApplication*) g_type_check_instance_cast ((GTypeInstance*) ((app)), ((g_application_get_type ())) )))), argc, argv);
public interface LibGtk3 {
// ...
@u_int64_t long g_application_get_type();
Pointer g_type_check_instance_cast(Pointer inst, @u_int64_t long type);
}
// Inside main method
lib.g_application_run(
lib.g_type_check_instance_cast(application, lib.g_application_get_type()), 0, null);
Now I'm pretty much clueless. The only I've not implemented is the pointer type-casting done by the macros, as I'm using the normal Pointer
class as input/return type in my interface. But I'm not sure if that's supposed to make any difference since these are mostly opaque pointers.
I hate to chime in with this but WFM. If I add @headius diff gtk3:run will work for me on:
openjdk version "16.0.2" 2021-07-20
OpenJDK Runtime Environment Temurin-16.0.2+7 (build 16.0.2+7)
OpenJDK 64-Bit Server VM Temurin-16.0.2+7 (build 16.0.2+7, mixed mode, sharing)
I get a Click me button in a frame popping up on my screen.
I also got this to work with graalvm ce 21.2 (openjdk version "11.0.12" openjdk version "11.0.12" 2021-07-20). I am on Fedora Core 34.
@praj-foss Can you do two things: 1) update to latest version of graalvm. Let's just hope there is a bug in graal that was fixed. 2) Install openjdk and verify it fails on that VM.
@praj-foss Since I did not see 21.3 is out I will get that and see if it also works.
I have pushed a branch with my change, which has been confirmed on @enebo's Fedora system and my MacOS system (the latter works after passing -XstartOnFirstThread
).
https://github.com/headius/jnr-demo/tree/patched
At this point I don't see any bug on the jnr-ffi side. @praj-foss let us know if you are still unable to run this and we'll have a look at your latest error.
I also downloaded graal ce 21.1.0 and it works with @headius patch.
@headius @enebo I downloaded Graalvm 21.3 (JDK 11) and Temurin JDK 16.0.2 and tried to run the patched repo, but it's still crashing the same: hs_err_pid6764.log. I even tried the -XstartOnFirstThread
arg. It's pretty clear now that something's wrong with my setup, but I don't have a clue where it might be bugging. So I guess it's okay to close the issue now. I'll try a system upgrade, and maybe run it on gtk4 and let you know how that goes. Can you suggest what else might fix this?
I tried running the app on two different machines: one with ubuntu 21.10 with openjdk 17, where it crashed similarly, and another with opensuse leap 15.2 with openjdk 11 and a slightly older gtk3 release, where it ran perfectly. I'm assuming something breaks on the new gtk3 release. So I'll close this issue for now. Thanks, everyone!
@praj-foss Thanks for following up and figuring this out! Please let us know if you file an issue with the GTK folks because I'd like to know that we're not doing anything wrong. I assume they will have better luck investigating why it crashes at that particular point.
@headius Sure! I'd like to do some more research on it though I'm not a C/C++ dev at all. Can you tell me how to debug the JNR/native calls? I came across this article which described how to use gdb
to debug JNI calls. But when I try to use it with my demo I only get warnings like this:
warning: Could not load shared library symbols for /tmp/jffi8423976058172872553.so
So what's the proper way to debug JNR here?
For that we would need to build a jffi binary with debug symbols. I'm not sure if the build is set up for that but can look into it this week.
I will say that your crasher that fails inside jffi should probably still be treated as a bug. May be something about your platforms that jffi is not handling correctly.
I believe this diff followed by running ant
should get you a jffi binary that has debug symbols:
diff --git a/jni/GNUmakefile b/jni/GNUmakefile
index cfe570a..4a8a061 100755
--- a/jni/GNUmakefile
+++ b/jni/GNUmakefile
@@ -61,7 +61,7 @@ LIBNAME = jffi
# Compiler/linker flags from:
# http://weblogs.java.net/blog/kellyohair/archive/2006/01/compilation_of_1.html
JFLAGS = -fno-omit-frame-pointer -fno-strict-aliasing -DNDEBUG
-OFLAGS = -O2 $(JFLAGS)
+OFLAGS = -Og -g $(JFLAGS)
# MacOS headers aren't completely warning free, so turn them off
WERROR = -Werror
Could you open a new issue for the crash within JFFI itself? I believe this issue has been resolved by fixing the client code, but this other crasher is a new mystery.
@headius I've reopened this issue in JFFI. Check out: jnr/jffi#118