netty / netty

Netty project - an event-driven asynchronous network application framework

Home Page:http://netty.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Thread blocked on InetAddress.getByName because of Netty

gaeljw opened this issue · comments

Expected behavior

Application starting fine.

Actual behavior

Application is hanging during startup because of a deadlock between:

  • a thread that tries to start a HTTP server
  • a thread that prepare some stuff to do some HTTP as a client (calling other services, not being a server) using Netty

The first thread is calling

new InetSocketAddress("0.0.0.0", somePort)

The second (in my understanding):

static final InetAddress INET6_ANY = InetAddress.getByName("::")
static final InetAddress INET_ANY = InetAddress.getByName("0.0.0.0")

Here are the two threads dumps:

The first (HTTP server) blocked:

ZScheduler-Worker-6" #30 daemon prio=5 os_prio=0 cpu=276.23ms elapsed=7541.20s tid=0x00007f7bd54d4880 nid=0x55 waiting for monitor entry  [0x00007f7b663f4000]
   java.lang.Thread.State: BLOCKED (on object monitor)
    at jdk.internal.loader.NativeLibraries.loadLibrary(java.base@17.0.10/Unknown Source)
    - waiting to lock <0x00000000a02e9a90> (a java.util.HashSet)
    at jdk.internal.loader.NativeLibraries.loadLibrary(java.base@17.0.10/Unknown Source)
    at jdk.internal.loader.NativeLibraries.findFromPaths(java.base@17.0.10/Unknown Source)
    at jdk.internal.loader.NativeLibraries.loadLibrary(java.base@17.0.10/Unknown Source)
    at jdk.internal.loader.NativeLibraries.loadLibrary(java.base@17.0.10/Unknown Source)
    at jdk.internal.loader.BootLoader.loadLibrary(java.base@17.0.10/Unknown Source)
    at java.net.InetAddress.<clinit>(java.base@17.0.10/Unknown Source)
    at java.net.InetSocketAddress.<init>(java.base@17.0.10/Unknown Source)
    at io.prometheus.metrics.exporter.httpserver.HTTPServer$Builder.makeInetSocketAddress(HTTPServer.java:209)
    at io.prometheus.metrics.exporter.httpserver.HTTPServer$Builder.buildAndStart(HTTPServer.java:197)
    at io.opentelemetry.exporter.prometheus.PrometheusHttpServer.<init>(PrometheusHttpServer.java:71)
    at io.opentelemetry.exporter.prometheus.PrometheusHttpServerBuilder.build(PrometheusHttpServerBuilder.java:68)
    at com.myapp.metrics.sdk.PrometheusMetricReader$.$anonfun$startReader$2(PrometheusMetricReader.scala:21)
    at com.myapp.metrics.sdk.PrometheusMetricReader$$$Lambda$1109/0x00007f7b7840e078.apply(Unknown Source)
    at zio.ZIOCompanionVersionSpecific.$anonfun$attempt$1(ZIOCompanionVersionSpecific.scala:100)
    at zio.ZIOCompanionVersionSpecific$$Lambda$430/0x00007f7b782ba000.apply(Unknown Source)
    at zio.internal.FiberRuntime.runLoop(FiberRuntime.scala:904)
    at zio.internal.FiberRuntime.runLoop(FiberRuntime.scala:890)
    at zio.internal.FiberRuntime.runLoop(FiberRuntime.scala:890)
    at zio.internal.FiberRuntime.runLoop(FiberRuntime.scala:890)
    at zio.internal.FiberRuntime.runLoop(FiberRuntime.scala:1024)
    at zio.internal.FiberRuntime.runLoop(FiberRuntime.scala:890)
    at zio.internal.FiberRuntime.runLoop(FiberRuntime.scala:1024)
    at zio.internal.FiberRuntime.runLoop(FiberRuntime.scala:967)
    at zio.internal.FiberRuntime.runLoop(FiberRuntime.scala:1024)
    at zio.internal.FiberRuntime.runLoop(FiberRuntime.scala:890)
    at zio.internal.FiberRuntime.runLoop(FiberRuntime.scala:1024)
    at zio.internal.FiberRuntime.runLoop(FiberRuntime.scala:967)
    at zio.internal.FiberRuntime.runLoop(FiberRuntime.scala:890)
    at zio.internal.FiberRuntime.runLoop(FiberRuntime.scala:890)
    at zio.internal.FiberRuntime.runLoop(FiberRuntime.scala:1024)
    at zio.internal.FiberRuntime.evaluateEffect(FiberRuntime.scala:381)
    at zio.internal.FiberRuntime.evaluateMessageWhileSuspended(FiberRuntime.scala:504)
    at zio.internal.FiberRuntime.drainQueueOnCurrentThread(FiberRuntime.scala:220)
    at zio.internal.FiberRuntime.run(FiberRuntime.scala:139)
    at zio.internal.ZScheduler$$anon$4.run(ZScheduler.scala:478)

The one "locking":

"ZScheduler-Worker-20" #44 daemon prio=5 os_prio=0 cpu=191.27ms elapsed=7541.20s tid=0x00007f7bd54e2e70 nid=0x63 in Object.wait()  [0x00007f7b655e3000]
   java.lang.Thread.State: RUNNABLE
    at io.netty.channel.epoll.LinuxSocket.unsafeInetAddrByName(LinuxSocket.java:364)
    - waiting on the Class initialization monitor for java.net.InetAddress
    at io.netty.channel.epoll.LinuxSocket.<clinit>(LinuxSocket.java:42)
    at jdk.internal.loader.NativeLibraries.load(java.base@17.0.10/Native Method)
    at jdk.internal.loader.NativeLibraries$NativeLibraryImpl.open(java.base@17.0.10/Unknown Source)
    at jdk.internal.loader.NativeLibraries.loadLibrary(java.base@17.0.10/Unknown Source)
    - locked <0x00000000a02e9a90> (a java.util.HashSet)
    at jdk.internal.loader.NativeLibraries.loadLibrary(java.base@17.0.10/Unknown Source)
    at java.lang.ClassLoader.loadLibrary(java.base@17.0.10/Unknown Source)
    at java.lang.Runtime.load0(java.base@17.0.10/Unknown Source)
    at java.lang.System.load(java.base@17.0.10/Unknown Source)
    at io.netty.util.internal.NativeLibraryUtil.loadLibrary(NativeLibraryUtil.java:36)
    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(java.base@17.0.10/Native Method)
    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(java.base@17.0.10/Unknown Source)
    at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(java.base@17.0.10/Unknown Source)
    at java.lang.reflect.Method.invoke(java.base@17.0.10/Unknown Source)
    at io.netty.util.internal.NativeLibraryLoader$1.run(NativeLibraryLoader.java:430)
    at java.security.AccessController.executePrivileged(java.base@17.0.10/Unknown Source)
    at java.security.AccessController.doPrivileged(java.base@17.0.10/Unknown Source)
    at io.netty.util.internal.NativeLibraryLoader.loadLibraryByHelper(NativeLibraryLoader.java:422)
    at io.netty.util.internal.NativeLibraryLoader.loadLibrary(NativeLibraryLoader.java:388)
    at io.netty.util.internal.NativeLibraryLoader.load(NativeLibraryLoader.java:218)
    at io.netty.channel.epoll.Native.loadNativeLibrary(Native.java:323)
    at io.netty.channel.epoll.Native.<clinit>(Native.java:85)
    at io.netty.channel.epoll.Epoll.<clinit>(Epoll.java:40)
    at zio.http.netty.ChannelFactories$Client$.$anonfun$fromConfig$4(ChannelFactories.scala:83)
    at zio.http.netty.ChannelFactories$Client$$$Lambda$966/0x00007f7b783d3e60.apply(Unknown Source)
    at zio.internal.FiberRuntime.runLoop(FiberRuntime.scala:890)
    at zio.internal.FiberRuntime.runLoop(FiberRuntime.scala:890)
    at zio.internal.FiberRuntime.runLoop(FiberRuntime.scala:890)
    at zio.internal.FiberRuntime.runLoop(FiberRuntime.scala:1024)
    at zio.internal.FiberRuntime.runLoop(FiberRuntime.scala:967)
    at zio.internal.FiberRuntime.runLoop(FiberRuntime.scala:890)
    at zio.internal.FiberRuntime.runLoop(FiberRuntime.scala:890)
    at zio.internal.FiberRuntime.runLoop(FiberRuntime.scala:1024)
    at zio.internal.FiberRuntime.runLoop(FiberRuntime.scala:890)
    at zio.internal.FiberRuntime.runLoop(FiberRuntime.scala:1024)
    at zio.internal.FiberRuntime.runLoop(FiberRuntime.scala:967)
    at zio.internal.FiberRuntime.runLoop(FiberRuntime.scala:890)
    at zio.internal.FiberRuntime.runLoop(FiberRuntime.scala:890)
    at zio.internal.FiberRuntime.runLoop(FiberRuntime.scala:1024)
    at zio.internal.FiberRuntime.runLoop(FiberRuntime.scala:1024)
    at zio.internal.FiberRuntime.runLoop(FiberRuntime.scala:967)
    at zio.internal.FiberRuntime.runLoop(FiberRuntime.scala:967)
    at zio.internal.FiberRuntime.evaluateEffect(FiberRuntime.scala:381)
    at zio.internal.FiberRuntime.evaluateMessageWhileSuspended(FiberRuntime.scala:504)
    at zio.internal.FiberRuntime.drainQueueOnCurrentThread(FiberRuntime.scala:220)
    at zio.internal.FiberRuntime.run(FiberRuntime.scala:139)
    at zio.internal.ZScheduler$$anon$4.run(ZScheduler.scala:478)

Complete thread dump available at https://gist.github.com/gaeljw/72c072262952c9c55a355ad689d3f3b5#file-thread_dump_netty_13871

Why is the Netty code locking forever?

Steps to reproduce

"just" start the application (no specific action required, it blocks at startup).

Minimal yet complete reproducer code (or URL to code)

The tricky part is that it's not always reproducible by nature.

Here's the minimized source of our app where we observe the behavior: https://github.com/gaeljw/netty13871.

# Build the app
podman build . -t netty13871
# Run it
podman run --rm netty13871

Netty version

4.1.89.Final + Incubator 0.0.19.Final

JVM version (e.g. java -version)

17.0.10 (Eclipse temurin container image)

OS version (e.g. uname -a)

Linux my-app 5.14.0-200.el9.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Nov 21 16:13:49 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

cat /etc/os-release 
PRETTY_NAME="Ubuntu 22.04.3 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.3 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy

Added a minimized reproduction code to hopefully help: https://github.com/gaeljw/netty13871

Will check... That said this looks more like a "JDK bug" as we are just calling a JDK method.

Thanks @normanmaurer . Now that you say it, I'm not sure why I thought it could be caused by Netty 🤐

If you somehow have more knowledge than me on how native libraries work and/or the internal library being loaded, I'd appreciate any info you can give me :)

@gaeljw actually .... maybe it is an issue in how we structured the code... let me think about it

@gaeljw I think this might explain it #13879

Good to know! Thanks a lot @normanmaurer 👏