elastic / apm-agent-java

Home Page:https://www.elastic.co/guide/en/apm/agent/java/current/index.html

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

SIGSEGV caused by ElasticApmTracer.captureException

kelunik opened this issue · comments

SIGSEGV caused by ElasticApmTracer.captureException:

V  [libjvm.so+0xa9808d]  LinkResolver::resolve_invoke(CallInfo&, Handle, constantPoolHandle const&, int, Bytecodes::Code, JavaThread*)+0x20d
V  [libjvm.so+0x82436f]  InterpreterRuntime::resolve_invoke(JavaThread*, Bytecodes::Code)+0x15f
V  [libjvm.so+0x8248a7]  InterpreterRuntime::resolve_from_cache(JavaThread*, Bytecodes::Code)+0x37
j  co.elastic.apm.agent.impl.ElasticApmTracer.captureException(JLjava/lang/Throwable;Lco/elastic/apm/agent/impl/transaction/ElasticContext;Ljava/lang/ClassLoader;)Lco/elastic/apm/agent/impl/error/ErrorCapture;+43
j  co.elastic.apm.agent.impl.ElasticApmTracer.captureAndReportException(JLjava/lang/Throwable;Lco/elastic/apm/agent/impl/transaction/ElasticContext;)Ljava/lang/String;+9
j  co.elastic.apm.agent.impl.transaction.AbstractSpan.captureExceptionAndGetErrorId(JLjava/lang/Throwable;)Ljava/lang/String;+16
j  co.elastic.apm.agent.impl.transaction.AbstractSpan.captureException(Ljava/lang/Throwable;)Lco/elastic/apm/agent/impl/transaction/AbstractSpan;+16
j  co.elastic.apm.agent.impl.transaction.AbstractSpan.captureException(Ljava/lang/Throwable;)Lco/elastic/apm/agent/tracer/AbstractSpan;+2
j  co.elastic.apm.agent.okhttp.OkHttp3ClientInstrumentation$OkHttpClient3ExecuteAdvice.onAfterExecute(Lokhttp3/Response;Ljava/lang/Throwable;[Ljava/lang/Object;)V+70
J 32507 c2 okhttp3.internal.connection.RealCall.execute()Lokhttp3/Response; (161 bytes) @ 0x00007f565abda8d8 [0x00007f565abd8c20+0x0000000000001cb8]

Steps to reproduce

Exact steps are unknown so far, but these crashes happen sporadically. They seem to be always related to OkHttpClient3ExecuteAdvice.

Expected behavior

No crash.

Are you running on Java 17+?

There have been other similar crashes, like #3521 . We have been in contact with Oracle and managed to reproduce the issue with them, it definitely seems to be a JVM bug: https://bugs.openjdk.org/browse/JDK-8322726

In version 1.48.1 we have added an undocumented configuration option -Delastic.apm.safe_exceptions=3 to workaround this issue at the cost of loosing observability: With this option set, the agent will avoid touching application exceptions and use placeholder exceptions instead. So the exception counts are still correct, just the exception details will be lost.

Yes, running Java 17.0.10. I've seen the undocumented option, but wasn't sure on the impact it really has. Can this be documented? Especially as =3 indicates there are multiple different configuration options.

Thanks for linking the upstream bug!

We are hoping for the JVM bug to be fixed soon. When that happens we'll remove the option from the agent again, that's why we are not planning to have it officially documented / supported, but I can give a quick explanation here:

The new option safe_exceptions is a bit-flag for enabling/disabling certain workarounds:

  • “Redacted Exceptions”: We record a “surrogate” exception which we create where we would have recorded the application exception. This means the error count and at least a similar stacktrace are preserved, the original exception type and message will be not recorded.

  • “Map-less propagation”: This is a workaround for crashes which seemed to happen with exceptions only captured in spring exception handlers. We would put those exception into the servlet request attributes and later extract them, which in turn sometimes caused a corrputed heap due to a bad exception pointer. Instead of putting the exception into the servlet request attributes, it is simply immediately added to the Transaction immediately.

    So the configuration option can be used as follows:

  • -Delastic.apm.safe_exceptions=3: “Redacted Exceptions” and “Map-less propagation” are both enabled

  • -Delastic.apm.safe_exceptions=2: Only “Map-less propagation” is enabled

  • -Delastic.apm.safe_exceptions=1: Only “Redacted Exceptions” is enabled

  • -Delastic.apm.safe_exceptions=0: None of the workarounds are enabled (default)

To be closed after JVM release with https://bugs.openjdk.org/browse/JDK-8322726 fixed (we also expect a backport to 17 and 21, so wait for those for tracking purposes before closing)

I just discovered a few hserror-files on one dev-server which appeared after migrating from Java11->21.
Could this also be the cause of the following SIGSEV observed on "OpenJDK 64-Bit Server VM Temurin-21+35 (21+35-LTS, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, windows-amd64)"?

The last pc belongs to invokevirtual (printed below).
Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j co.elastic.apm.agent.impl.error.ErrorCapture.computeCulprit(Ljava/lang/Throwable;Ljava/util/Collection;)V+1
j co.elastic.apm.agent.impl.error.ErrorCapture.getCulprit()Ljava/lang/StringBuilder;+48
j co.elastic.apm.agent.report.serialize.DslJsonSerializer$Writer.serializeError(Lco/elastic/apm/agent/impl/error/ErrorCapture;)V+61
j co.elastic.apm.agent.report.serialize.DslJsonSerializer$Writer.serializeErrorNdJson(Lco/elastic/apm/agent/impl/error/ErrorCapture;)V+17
J 14222 c1 co.elastic.apm.agent.report.IntakeV2ReportingEventHandler.writeEvent(Lco/elastic/apm/agent/report/ReportingEvent;)V (148 bytes) @ 0x000000000cfaf5f4 [0x000000000cfae1a0+0x0000000000001454]
J 14133 c1 co.elastic.apm.agent.report.IntakeV2ReportingEventHandler.handleIntakeEvent(Lco/elastic/apm/agent/report/ReportingEvent;JZ)V (140 bytes) @ 0x000000000cf82864 [0x000000000cf82540+0x0000000000000324]
j co.elastic.apm.agent.report.IntakeV2ReportingEventHandler.dispatchEvent(Lco/elastic/apm/agent/report/ReportingEvent;JZ)V+100
j co.elastic.apm.agent.report.IntakeV2ReportingEventHandler.onEvent(Lco/elastic/apm/agent/report/ReportingEvent;JZ)V+110
j co.elastic.apm.agent.report.IntakeV2ReportingEventHandler.onEvent(Ljava/lang/Object;JZ)V+8
J 20007% c2 com.lmax.disruptor.BatchEventProcessor.processEvents()V (167 bytes) @ 0x00000000145365b0 [0x0000000014536480+0x0000000000000130]
j com.lmax.disruptor.BatchEventProcessor.run()V+37
j co.elastic.apm.agent.util.ExecutorUtils$2.run()V+41
j java.lang.Thread.runWith(Ljava/lang/Object;Ljava/lang/Runnable;)V+5 java.base@21
j java.lang.Thread.run()V+19 java.base@21
v ~StubRoutines::call_stub 0x0000000012711015

Seems likely. We have mostly seen crashes due to corrupted pointers which should point to exceptions, your stacktrace indicates that this is similar. Also the bug has been confirmed to be present in all 17+ versions.

The fix is already merged in the OpenJDK main branch and should therefore be part of the next upcoming releases.