netty / netty-tcnative

A fork of Apache Tomcat Native, based on finagle-native

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Memory leak on SslProvider instantiation with netty-tcnative-boringssl-static

notdryft opened this issue · comments

We observed a memory leak that happens by switching from netty-tcnative-boringssl-static version 2.0.56.Final to 2.0.57.Final (or later, we tested up to 2.0.60.Final).

Scenario is as follow:

  • Create a SslContextBuilder once
SslContextBuilder sslContextBuilder = SslContextBuilder.forClient.sslProvider(SslProvider.OPENSSL_REFCNT)
sslContextBuilder.ciphers(null, IdentityCipherSuiteFilter.INSTANCE_DEFAULTING_TO_SUPPORTED_CIPHERS)

For us the issue happened with both SslProvider.OPENSSL_REFCNT and SslProvider.OPENSSL.

Then, inside a loop:

  • SslContext sslContext = sslContextBuilder.build()
  • ReferenceCountUtil.release(sslContext)

Do that millions of times and you can see the following happen, left is 2.0.56.Final and after the memory drop, 2.0.57.Final:

Screenshot 2023-05-04 at 15 12 10

In our test case, -Xms and -Xmx were both 1G so memory shouldn't be expected to grow that much more as everything is recycled right after instantiation.

Sample project that reproduces the issue provided as an attachment: test-netty-tcnative-leak.zip

Command to run the project: mvn compile exec:java -Dexec.mainClass=Main

Forgot to mention something. When trying to pin down the issue, we used -Dio.netty.leakDetection.level=paranoid but it printed nothing!

Caused by #759 ?

@normanmaurer

or later, we tested up to 2.0.60.Final

Included. 2.0.60 is the version we currently have in production and where we originally detected the issue. We were able to figure out the regression was introduced in 2.0.57.

Context: for Gatling's most common use case, we want to have distinct SSLContexts per virtual user so we generate the proper number of handshakes and SSLSessions. The leak goes probably unnoticed for the more standard use case of having one static SSLContext per client application.

Got it... will have a look

Thanks a lot! 🙇

Should be fixed by #790 🤦

Can still observe a leak after the fix:

Screenshot 2023-05-05 at 12 44 52

Not sure if it's because I compiled the fix branch badly or because there is more to the actual fix... will continue digging.

Actually had clashing lib loading paths. Compiling again and confirming soon!

Confirming it was a path clash, everything looks good on 10M iterations:

Screenshot 2023-05-05 at 13 19 06

calloc was also present in async-profiler's nativemem mode (built from the malloc branch in their own repo) before the fix. It's now all gone 🎉

Many thanks for double checking @notdryft