strimzi / strimzi-kafka-oauth

OAuth2 support for Apache Kafka® to work with many OAuth2 authorization servers

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

kafka consumer app using oauth frequently experience startup connection problems

jrivers96 opened this issue · comments

Summary
50 apps consume from 50 different kafka topics and we noticed a 504 gateway timeout happens for about one of the 50 apps on the first connection. The app retries and then connects appropriately. It seems odd that this happens with that frequency and with regularity. I've also added a log for another error that has been seen on app startup at the end of this ticket related to SSL.

Keycloak is connected through an external load balancer in AWS to the kafka consumer apps. I scrutinized the load balancer and the load on keycloak and these seem healthy.

We are on strimzi 0.21 and using the strimzi oauth adapter with keycloak 11.

one potential improvement?
client send's a request. Server has a problem. client doesn't timeout until the gateway does. I'm not sure the reason the server has a problem. Should the oauth client have a timeout set?

I'm reading through the oauth client code.
https://github.com/strimzi/strimzi-kafka-oauth/blob/main/oauth-common/src/main/java/io/strimzi/kafka/oauth/common/OAuthAuthenticator.java#L92

It doesn't seem like timeouts are set on the http client side.
https://bluxte.net/musings/2008/08/25/dont-forget-set-javaneturl-default-timeouts/

Any thoughts?

Error Logs

org.apache.kafka.common.KafkaException: Failed to construct kafka consumer
       at org.apache.kafka.clients.consumer.KafkaConsumer.<init>(KafkaConsumer.java:823)
       at org.apache.kafka.clients.consumer.KafkaConsumer.<init>(KafkaConsumer.java:667)
       at com.mycompany.mypackageclient.internal.BasicKafkaConsumer.<init>(BasicKafkaConsumer.java:19)
       at MYAPPcom.mycompany.mypackageclient.internal.KafkaAvroConsumer.<init>(KafkaAvroConsumer.java:15)
       at com.mycompany.mypackageclient.consumer.mycompanyKafkaAvroConsumer.getConsumer(mycompanyKafkaAvroConsumer.java:158)
       at com.mycompany.mypackageclient.consumer.mycompanyKafkaAvroConsumer.getKafkaConsumer(mycompanyKafkaAvroConsumer.java:111)
       at com.mycompany.mypackageclient.MYAPPClient.MYAPPKafkaConsumer(MYAPPClient.java:107)
       at com.mycompany.foo.mypackage.myappConsumer.run(myappConsumer.java:109)
       at com.mycompany.foo.mypackage.myappConsumer$$Lambda$603/0x000000000103e660.run(Unknown Source)
       at java.base/java.lang.Thread.run(Thread.java:836)
Caused by: org.apache.kafka.common.KafkaException: javax.security.auth.login.LoginException: io.strimzi.kafka.oauth.common.HttpException: POST request to https://auth.mycompany.com/auth/realms/pro-realm/protocol/openid-connect/token failed with status 504: GATEWAY_TIMEOUT
       at io.strimzi.kafka.oauth.common.HttpUtil.handleResponse(HttpUtil.java:148)
       at io.strimzi.kafka.oauth.common.HttpUtil.request(HttpUtil.java:129)
       at io.strimzi.kafka.oauth.common.HttpUtil.post(HttpUtil.java:62)
       at io.strimzi.kafka.oauth.common.OAuthAuthenticator.post(OAuthAuthenticator.java:92)
       at io.strimzi.kafka.oauth.common.OAuthAuthenticator.loginWithClientSecret(OAuthAuthenticator.java:60)
       at io.strimzi.kafka.oauth.client.JaasClientOauthLoginCallbackHandler.handleCallback(JaasClientOauthLoginCallbackHandler.java:158)
       at io.strimzi.kafka.oauth.client.JaasClientOauthLoginCallbackHandler.handle(JaasClientOauthLoginCallbackHandler.java:138)
       at org.apache.kafka.common.security.oauthbearer.OAuthBearerLoginModule.identifyToken(OAuthBearerLoginModule.java:316)
       at org.apache.kafka.common.security.oauthbearer.OAuthBearerLoginModule.login(OAuthBearerLoginModule.java:301)
       at java.base/javax.security.auth.login.LoginContext.invoke(LoginContext.java:726)
       at java.base/javax.security.auth.login.LoginContext$4.run(LoginContext.java:665)
       at java.base/javax.security.auth.login.LoginContext$4.run(LoginContext.java:663)
       at java.base/java.security.AccessController.doPrivileged(AccessController.java:770)
       at java.base/javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:663)
       at java.base/javax.security.auth.login.LoginContext.login(LoginContext.java:574)
       at org.apache.kafka.common.security.oauthbearer.internals.expiring.ExpiringCredentialRefreshingLogin.login(ExpiringCredentialRefreshingLogin.java:204)
       at org.apache.kafka.common.security.oauthbearer.internals.OAuthBearerRefreshingLogin.login(OAuthBearerRefreshingLogin.java:150)
       at org.apache.kafka.common.security.authenticator.LoginManager.<init>(LoginManager.java:62)
       at org.apache.kafka.common.security.authenticator.LoginManager.acquireLoginManager(LoginManager.java:105)
       at org.apache.kafka.common.network.SaslChannelBuilder.configure(SaslChannelBuilder.java:158)
       at org.apache.kafka.common.network.ChannelBuilders.create(ChannelBuilders.java:157)
       at org.apache.kafka.common.network.ChannelBuilders.clientChannelBuilder(ChannelBuilders.java:73)
       at org.apache.kafka.clients.ClientUtils.createChannelBuilder(ClientUtils.java:105)
       at org.apache.kafka.clients.consumer.KafkaConsumer.<init>(KafkaConsumer.java:740)
       at org.apache.kafka.clients.consumer.KafkaConsumer.<init>(KafkaConsumer.java:667)
       at com.mycompany.mypackageclient.internal.BasicKafkaConsumer.<init>(BasicKafkaConsumer.java:19)
       at com.mycompany.mypackageclient.internal.KafkaAvroConsumer.<init>(KafkaAvroConsumer.java:15)
       at com.mycompany.mypackageclient.consumer.mycompanyKafkaAvroConsumer.getConsumer(mycompanyKafkaAvroConsumer.java:158)
       at com.mycompany.mypackageclient.consumer.mycompanyKafkaAvroConsumer.getKafkaConsumer(mycompanyKafkaAvroConsumer.java:111)
       at com.mycompany.mypackageclient.MYAPPClient.MYAPPKafkaConsumer(MYAPPClient.java:107)
       at com.mycompany.foo.mypackage.myappConsumer.run(myappConsumer.java:109)
       at com.mycompany.foo.mypackage.myappConsumer$$Lambda$603/0x000000000103e660.run(Unknown Source)
       at java.base/java.lang.Thread.run(Thread.java:836)

Second log from another app connection failure. These are likely not related, but recording it here because it's an app startup connection problem.

021-05-12 14:41:57.131 INFO [kafka-producer-network-thread | producer-1] [o.a.k.c.Metadata] - [Producer clientId=producer-1] Cluster ID: foo
2021-05-12 14:41:57.526 ERROR [myapp] [o.a.k.c.s.o.OAuthBearerLoginModule] - Couldn't kickstart handshaking
javax.net.ssl.SSLException: Couldn't kickstart handshaking
      at java.base/sun.security.ssl.Alert.createSSLException(Alert.java:127)
      at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:350)
      at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:293)
      at java.base/sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:450)
      at java.base/sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:411)
      at java.base/sun.net.www.protocol.https.HttpsClient.afterConnect(HttpsClient.java:567)
      at java.base/sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:185)
      at java.base/sun.net.www.protocol.https.HttpsURLConnectionImpl.connect(HttpsURLConnectionImpl.java:168)
      at io.strimzi.kafka.oauth.common.HttpUtil.request(HttpUtil.java:118)
      at io.strimzi.kafka.oauth.common.HttpUtil.post(HttpUtil.java:62)
      at io.strimzi.kafka.oauth.common.OAuthAuthenticator.post(OAuthAuthenticator.java:92)
      at io.strimzi.kafka.oauth.common.OAuthAuthenticator.loginWithClientSecret(OAuthAuthenticator.java:60)
      at io.strimzi.kafka.oauth.client.JaasClientOauthLoginCallbackHandler.handleCallback(JaasClientOauthLoginCallbackHandler.java:158)
      at io.strimzi.kafka.oauth.client.JaasClientOauthLoginCallbackHandler.handle(JaasClientOauthLoginCallbackHandler.java:138)
      at org.apache.kafka.common.security.oauthbearer.OAuthBearerLoginModule.identifyToken(OAuthBearerLoginModule.java:316)
      at org.apache.kafka.common.security.oauthbearer.OAuthBearerLoginModule.login(OAuthBearerLoginModule.java:301)
      at java.base/javax.security.auth.login.LoginContext.invoke(LoginContext.java:726)
      at java.base/javax.security.auth.login.LoginContext$4.run(LoginContext.java:665)
      at java.base/javax.security.auth.login.LoginContext$4.run(LoginContext.java:663)
      at java.base/java.security.AccessController.doPrivileged(AccessController.java:770)
      at java.base/javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:663)
      at java.base/javax.security.auth.login.LoginContext.login(LoginContext.java:574)
      at org.apache.kafka.common.security.oauthbearer.internals.expiring.ExpiringCredentialRefreshingLogin.login(ExpiringCredentialRefreshingLogin.java:204)
      at org.apache.kafka.common.security.oauthbearer.internals.OAuthBearerRefreshingLogin.login(OAuthBearerRefreshingLogin.java:150)
      at org.apache.kafka.common.security.authenticator.LoginManager.<init>(LoginManager.java:62)
      at org.apache.kafka.common.security.authenticator.LoginManager.acquireLoginManager(LoginManager.java:105)
      at org.apache.kafka.common.network.SaslChannelBuilder.configure(SaslChannelBuilder.java:158)
      at org.apache.kafka.common.network.ChannelBuilders.create(ChannelBuilders.java:157)
      at org.apache.kafka.common.network.ChannelBuilders.clientChannelBuilder(ChannelBuilders.java:73)
      at org.apache.kafka.clients.ClientUtils.createChannelBuilder(ClientUtils.java:105)
      at org.apache.kafka.clients.consumer.KafkaConsumer.<init>(KafkaConsumer.java:740)
      at org.apache.kafka.clients.consumer.KafkaConsumer.<init>(KafkaConsumer.java:667)
      at com.mycompany.myappclient.internal.BasicKafkaConsumer.<init>(BasicKafkaConsumer.java:19)
      at com.mycompany.myappclient.internal.KafkaAvroConsumer.<init>(KafkaAvroConsumer.java:15)
      at com.mycompany.myappclient.internal.ReadSchemaTopic.getConsumer(ReadSchemaTopic.java:149)
      at com.mycompany.myappclient.internal.ReadSchemaTopic.readSchema(ReadSchemaTopic.java:36)
      at com.mycompany.myappclient.consumer.mycompanyKafkaAvroConsumer.getKafkaConsumer(mycompanyKafkaAvroConsumer.java:104)
      at com.mycompany.myappclient.myappClient.myappKafkaConsumer(myappClient.java:107)
      at com.mycompany.foo.myapp.myappConsumer.connect(myappConsumer.java:101)
      at com.mycompany.foo.myapp.myappConsumer.connectLoop(myappConsumer.java:123)
      at com.mycompany.foo.myapp.myappConsumer.run(myappConsumer.java:145)
      at com.mycompany.foo.myapp.myappConsumer$$Lambda$604/0x00000000e54d5920.run(Unknown Source)
      at java.base/java.lang.Thread.run(Thread.java:836)
      Suppressed: java.net.SocketException: Broken pipe (Write failed)
            at java.base/java.net.SocketOutputStream.socketWrite0(Native Method)
            at java.base/java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:110)
            at java.base/java.net.SocketOutputStream.write(SocketOutputStream.java:150)
            at java.base/sun.security.ssl.SSLSocketOutputRecord.encodeAlert(SSLSocketOutputRecord.java:81)
            at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:381)
            ... 41 common frames omitted
Caused by: java.net.SocketException: Connection reset by peer (Write failed)
      at java.base/java.net.SocketOutputStream.socketWrite0(Native Method)
      at java.base/java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:110)
      at java.base/java.net.SocketOutputStream.write(SocketOutputStream.java:150)
      at java.base/sun.security.ssl.SSLSocketOutputRecord.flush(SSLSocketOutputRecord.java:251)
      at java.base/sun.security.ssl.HandshakeOutStream.flush(HandshakeOutStream.java:89)
      at java.base/sun.security.ssl.ClientHello$ClientHelloKickstartProducer.produce(ClientHello.java:658)
      at java.base/sun.security.ssl.SSLHandshake.kickstart(SSLHandshake.java:525)
      at java.base/sun.security.ssl.ClientHandshakeContext.kickstart(ClientHandshakeContext.java:107)
      at java.base/sun.security.ssl.TransportContext.kickstart(TransportContext.java:233)
      at java.base/sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:433)
      ... 39 common frames omitted
2021-05-12 14:41:57.528 ERROR [myapp] [c.n.n.m.n.myappConsumer] - TOPIC-10: Error setting up myapp: KafkaException: Failed to construct kafka consumer
org.apache.kafka.common.KafkaException: Failed to construct kafka consumer
      at org.apache.kafka.clients.consumer.KafkaConsumer.<init>(KafkaConsumer.java:823)
      at org.apache.kafka.clients.consumer.KafkaConsumer.<init>(KafkaConsumer.java:667)
      at com.mycompany.myappclient.internal.BasicKafkaConsumer.<init>(BasicKafkaConsumer.java:19)
      at com.mycompany.myappclient.internal.KafkaAvroConsumer.<init>(KafkaAvroConsumer.java:15)
      at com.mycompany.myappclient.internal.ReadSchemaTopic.getConsumer(ReadSchemaTopic.java:149)
      at com.mycompany.myappclient.internal.ReadSchemaTopic.readSchema(ReadSchemaTopic.java:36)
      at com.mycompany.myappclient.consumer.mycompanyKafkaAvroConsumer.getKafkaConsumer(mycompanyKafkaAvroConsumer.java:104)
      at com.mycompany.myappclient.myappClient.myappKafkaConsumer(myappClient.java:107)
      at com.mycompany.foo.myapp.myappConsumer.connect(myappConsumer.java:101)
      at com.mycompany.foo.myapp.myappConsumer.connectLoop(myappConsumer.java:123)
      at com.mycompany.foo.myapp.myappConsumer.run(myappConsumer.java:145)
      at com.mycompany.foo.myapp.myappConsumer$$Lambda$604/0x00000000e54d5920.run(Unknown Source)
      at java.base/java.lang.Thread.run(Thread.java:836)
Caused by: org.apache.kafka.common.KafkaException: javax.security.auth.login.LoginException: An internal error occurred while retrieving token from callback handler
      at org.apache.kafka.common.network.SaslChannelBuilder.configure(SaslChannelBuilder.java:172)
      at org.apache.kafka.common.network.ChannelBuilders.create(ChannelBuilders.java:157)
      at org.apache.kafka.common.network.ChannelBuilders.clientChannelBuilder(ChannelBuilders.java:73)
      at org.apache.kafka.clients.ClientUtils.createChannelBuilder(ClientUtils.java:105)
      at org.apache.kafka.clients.consumer.KafkaConsumer.<init>(KafkaConsumer.java:740)
      ... 12 common frames omitted
Caused by: javax.security.auth.login.LoginException: An internal error occurred while retrieving token from callback handler
      at org.apache.kafka.common.security.oauthbearer.OAuthBearerLoginModule.identifyToken(OAuthBearerLoginModule.java:319)
      at org.apache.kafka.common.security.oauthbearer.OAuthBearerLoginModule.login(OAuthBearerLoginModule.java:301)
      at java.base/javax.security.auth.login.LoginContext.invoke(LoginContext.java:726)
      at java.base/javax.security.auth.login.LoginContext$4.run(LoginContext.java:665)
      at java.base/javax.security.auth.login.LoginContext$4.run(LoginContext.java:663)
      at java.base/java.security.AccessController.doPrivileged(AccessController.java:770)
      at java.base/javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:663)
      at java.base/javax.security.auth.login.LoginContext.login(LoginContext.java:574)
      at org.apache.kafka.common.security.oauthbearer.internals.expiring.ExpiringCredentialRefreshingLogin.login(ExpiringCredentialRefreshingLogin.java:204)
      at org.apache.kafka.common.security.oauthbearer.internals.OAuthBearerRefreshingLogin.login(OAuthBearerRefreshingLogin.java:150)
      at org.apache.kafka.common.security.authenticator.LoginManager.<init>(LoginManager.java:62)
      at org.apache.kafka.common.security.authenticator.LoginManager.acquireLoginManager(LoginManager.java:105)
      at org.apache.kafka.common.network.SaslChannelBuilder.configure(SaslChannelBuilder.java:158)
      ... 16 common frames omitted
2021-05-12 14:41:58.023 INFO [main] [o.s.b.a.s.s.UserDetailsServiceAutoConfiguration] -

@mstruk Didn't you see some similar issue?

These exceptions all seem to be network related in one way or another.

The first one (504 Gateway error) properly performs an HTTP request, but a reverse proxy in front of the Keycloak seems to be unable to forward the request to Keycloak. No socket connect timeout setting on the client can help here if any kind of delay occurs after successfully connecting to reverse proxy / gateway in the first place.

The second one (connection reset by peer) means that the reverse proxy / gateway dropped the connection in the middle of SSL negotiation. No client network setting can help with that.

I would look for additional clues on the server side - on the gateway / reverse proxy (why periodically the reverse proxy / gateway fails to connect to Keycloak) , on the Keycloak - if Keycloak fails to connect to its database for example, and also the routing in front of the reverse proxy / gateway. More logs may give you more clues but since it's networking error, the only thing you can do on the client is to retry the connection.

To retry, you can use the technique described here.

@mstruk Didn't you see some similar issue?

@scholzj I don't recall investigating this issue before.

One other possibility would be to introduce a retry mechanism into strimzi-kafka-oauth token retrieval logic itself. That could certainly be done and would improve robustness in the face of a fragile network. But then, a fragile network may cause other issues anyway.

The enhancement you suggest might help as well.

The default URLConnection implementation uses the following two system properties to set the default for connectTimeout and readTimeout:

  • sun.net.client.defaultConnectTimeout
  • sun.net.client.defaultReadTimeout

As documented here.

But that sets this as default for all uses of java.net.URLConnection.

We could add the ability to explicitly configure the connectTimeout and readTimeout for the token acquisition connection through strimzi-kafka-oauth configuration, using the values to set these for individual connection.

One interesting data point is that I can't reproduce the 504 errors running hundreds of curl post jobs in parallel.

The retry mechanism into strimzi-kafka-oauth might be useful because the exception is propagated upwards causing the app to be torn down. This might be a bit heavy given the transient nature of the error?

@jrivers96 The recently released 0.8.0 client does not use Keycloak anymore. But I'm not sure if @mstruk used Keycloak for anything else than parsing the tokens.

We never used Keycloak adapter so I think this refers more to the fact that we use the default behaviour of java.net.HttpsURLConnection to reuse keep alive connection (pool them for a while). The problem here I'd say has to do with the fact that the reverse proxy or gateway in front of the SSO seems to either return a valid HTTP response (e.g. 504 Gateway error) or drop a connection upon trying to reuse it.

You can try and disable the keep-alive behavior by setting System propertyhttp.keepAlive to false as described here.

I think this problem is related to a keycloak memory leak. keycloak/keycloak#9647. It also could be related to inefficient usage of the keycloak resource pool by strimzi related to timeouts.

I'm going to close this issue for now as it doesn't seem to be related to strimzi. I'll open it back up when I have more evidence about this.