elastic / apm-agent-java

Home Page:https://www.elastic.co/guide/en/apm/agent/java/current/index.html

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Workaround for JVM bug causing crashes on exception access

jackshirazi opened this issue · comments

Describe the bug

We have a number of different reports where the agent accessing exceptions has caused JVM crashes in JVM 17+ (but only after many hours of JVM load). This has happened in scenarios where the agent is not loading any native code (ie inferred spans is left disabled which is the default), including in simple situations where the application raises the exception, and the advice synchronously tries to read the exception (the first touch by the agent) and the JVM crashes with SIGSEGV based on the exception native side having been nulled

Given the agent doesn't have any native code loaded, and the bytecode transformations are standard ones that many agents do using Byte Buddy, we've looked for what our agent does differently from other agents (since we haven't heard of similar crashes from other agents). There is one significant difference, while most agents inline their advice code (effectively transforming a method to include the advice code) the Elastic agent uses Byte Buddy's non-inlined invokedynamic based advice which inserts a bytecode to a dynamic dispatch call out to the advice code

We hypothesize that between JVM 11 and 17, G1 processing changed to be more aggressive about nulling native-side data of exception objects, based presumably on escape analysis (or similar) determining that the exception has gone out of scope of the application. In the case of inlined code that accesses the exception, the escape analysis would determine that the exception was still in scope. We hypothesize however that the case where the bytecode has been retransformed to add in an invokedynamic bytecode which does a callsite lookup, the escape analysis incorrectly fails to identify that the exception object lifetime has changed to now have a longer life and continues to inform the GC that the exception can be nulled. In that scenario there is a race condition between the GC and the agent. In most cases the agent will quickly access the exception to get the information for error reporting, and add that information to traces, and then the exception is indeed out of scope of application (and agent). But every once in a while a GC will be triggered just before the agent accesses the exception, the GC erroneously thinks the exception is out of scope and nulls it, then the agent access the actually still live exception and the JVM crashes with SIGSEGV.

If this hypothesis is correct, we could workaround the JVM bug by inlining the exception processing

Steps to reproduce

Not reproducible in test scenarios, all crash reports have been after multiple hours (often days) of load in production systems

Expected behavior

JVM doesn't crash

Hi @jackshirazi

We are currently also struggling with the problem of JVM crashes related to JDK17 and ElasticAPM.

We have now switched to Eclipse Temurin JDK 17.0.10+7 and are using the Elastic Apm Java Agent 1.45. The applications run in Docker containers orchestrated with Docker Compose.

In #3257 you already recommended switching to the latest JDK 17. When we rolled out a version with the new JDK today, there was an instance with a JVM crash.

We noticed that this problem mostly occurs in a Scheduled Task.

Here is an excerpt from the today's crash report:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x0000000000000000, pid=7, tid=151
#
# JRE version: OpenJDK Runtime Environment Temurin-17.0.10+7 (17.0.10+7) (build 17.0.10+7)
# Java VM: OpenJDK 64-Bit Server VM Temurin-17.0.10+7 (17.0.10+7, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0xbafbc5]  Method::checked_resolve_jmethod_id(_jmethodID*)+0x45
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h" (or dumping to /usr/local/tomcat/core.7)
#
# If you would like to submit a bug report, please visit:
#   https://github.com/adoptium/adoptium-support/issues
#

---------------  T H R E A D  ---------------

Current thread (0x00007ff654ae27e0):  JavaThread "elastic-apm-sampling-profiler" daemon [_thread_in_vm, id=151, stack(0x00007ff5eaad6000,0x00007ff5eab36000)]

Stack: [0x00007ff5eaad6000,0x00007ff5eab36000],  sp=0x00007ff5eab33b00,  free space=374k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0xbafbc5]  Method::checked_resolve_jmethod_id(_jmethodID*)+0x45
V  [libjvm.so+0x9c77d6]  jvmti_GetMethodDeclaringClass+0xd6
C  [libasyncProfiler-linux-x64-4813494d137e1631bba301d5acab6e7b-c6d96bf140e0562f0ce2765a99ced1c2.so+0xebac]  Recording::writeStackTraces(Buffer*)+0x29c
C  [libasyncProfiler-linux-x64-4813494d137e1631bba301d5acab6e7b-c6d96bf140e0562f0ce2765a99ced1c2.so+0x10e57]  Recording::writeCheckpoint(Buffer*)+0x287
C  [libasyncProfiler-linux-x64-4813494d137e1631bba301d5acab6e7b-c6d96bf140e0562f0ce2765a99ced1c2.so+0x11d7f]  Recording::~Recording()+0x60f
C  [libasyncProfiler-linux-x64-4813494d137e1631bba301d5acab6e7b-c6d96bf140e0562f0ce2765a99ced1c2.so+0xb273]  FlightRecorder::stop()+0x23
C  [libasyncProfiler-linux-x64-4813494d137e1631bba301d5acab6e7b-c6d96bf140e0562f0ce2765a99ced1c2.so+0x20856]  Profiler::stop()+0x116
C  [libasyncProfiler-linux-x64-4813494d137e1631bba301d5acab6e7b-c6d96bf140e0562f0ce2765a99ced1c2.so+0x22135]  Profiler::runInternal(Arguments&, std::ostream&)+0x1a5
C  [libasyncProfiler-linux-x64-4813494d137e1631bba301d5acab6e7b-c6d96bf140e0562f0ce2765a99ced1c2.so+0x18fc9]  Java_one_profiler_AsyncProfiler_execute0+0x559
j  co.elastic.apm.agent.profiler.asyncprofiler.AsyncProfiler.execute0(Ljava/lang/String;)Ljava/lang/String;+0
j  co.elastic.apm.agent.profiler.asyncprofiler.AsyncProfiler.execute(Ljava/lang/String;)Ljava/lang/String;+2
j  co.elastic.apm.agent.profiler.SamplingProfiler.profile(Lco/elastic/apm/agent/tracer/configuration/TimeDuration;)V+74
j  co.elastic.apm.agent.profiler.SamplingProfiler.run()V+201
j  java.util.concurrent.Executors$RunnableAdapter.call()Ljava/lang/Object;+4 java.base@17.0.10
J 25854 c1 java.util.concurrent.FutureTask.run()V java.base@17.0.10 (176 bytes) @ 0x00007ff7276ddb5c [0x00007ff7276dd020+0x0000000000000b3c]
j  java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run()V+28 java.base@17.0.10
j  java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V+92 java.base@17.0.10
j  java.util.concurrent.ThreadPoolExecutor$Worker.run()V+5 java.base@17.0.10
j  co.elastic.apm.agent.util.ExecutorUtils$2.run()V+41
j  java.lang.Thread.run()V+11 java.base@17.0.10
v  ~StubRoutines::call_stub
V  [libjvm.so+0x829895]  JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, JavaThread*)+0x315
V  [libjvm.so+0x82b08b]  JavaCalls::call_virtual(JavaValue*, Handle, Klass*, Symbol*, Symbol*, JavaThread*)+0x1cb
V  [libjvm.so+0x8f78e3]  thread_entry(JavaThread*, JavaThread*)+0xa3
V  [libjvm.so+0xe5fa16]  JavaThread::thread_main_inner()+0x196
V  [libjvm.so+0xe633b8]  Thread::call_run()+0xa8
V  [libjvm.so+0xc24791]  thread_native_entry(Thread*)+0xe1

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j  co.elastic.apm.agent.profiler.asyncprofiler.AsyncProfiler.execute0(Ljava/lang/String;)Ljava/lang/String;+0
j  co.elastic.apm.agent.profiler.asyncprofiler.AsyncProfiler.execute(Ljava/lang/String;)Ljava/lang/String;+2
j  co.elastic.apm.agent.profiler.SamplingProfiler.profile(Lco/elastic/apm/agent/tracer/configuration/TimeDuration;)V+74
j  co.elastic.apm.agent.profiler.SamplingProfiler.run()V+201
j  java.util.concurrent.Executors$RunnableAdapter.call()Ljava/lang/Object;+4 java.base@17.0.10
J 25854 c1 java.util.concurrent.FutureTask.run()V java.base@17.0.10 (176 bytes) @ 0x00007ff7276ddb5c [0x00007ff7276dd020+0x0000000000000b3c]
j  java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run()V+28 java.base@17.0.10
j  java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V+92 java.base@17.0.10
j  java.util.concurrent.ThreadPoolExecutor$Worker.run()V+5 java.base@17.0.10
j  co.elastic.apm.agent.util.ExecutorUtils$2.run()V+41
j  java.lang.Thread.run()V+11 java.base@17.0.10
v  ~StubRoutines::call_stub

If you need the full report or additional excerpts, please let me know.

Thanks, that's a different issue, you need to stop using the inferred spans option

Oracle have considered this hypothesis, and can't see how it would occur, so I'm closing this specific issue now. If anyone has crashes NOT related to the asyncprofiler, please open a discussion thread at our discussion forum. Any crashes related to asyncprofiler should be resolved by going back to the default false value for profiling_inferred_spans_enabled. This option is experimental and we have no immediate plan to upgrade the asyncprofiler to the latest version which is the likely solution for crashes related to asyncprofiler.

@jackshirazi thx for quick reply! Sorry that I commented on the wrong issue