kafka consumer app using oauth frequently experience startup connection problems
jrivers96 opened this issue · comments
Summary
50 apps consume from 50 different kafka topics and we noticed a 504 gateway timeout happens for about one of the 50 apps on the first connection. The app retries and then connects appropriately. It seems odd that this happens with that frequency and with regularity. I've also added a log for another error that has been seen on app startup at the end of this ticket related to SSL.
Keycloak is connected through an external load balancer in AWS to the kafka consumer apps. I scrutinized the load balancer and the load on keycloak and these seem healthy.
We are on strimzi 0.21 and using the strimzi oauth adapter with keycloak 11.
one potential improvement?
client send's a request. Server has a problem. client doesn't timeout until the gateway does. I'm not sure the reason the server has a problem. Should the oauth client have a timeout set?
I'm reading through the oauth client code.
https://github.com/strimzi/strimzi-kafka-oauth/blob/main/oauth-common/src/main/java/io/strimzi/kafka/oauth/common/OAuthAuthenticator.java#L92
It doesn't seem like timeouts are set on the http client side.
https://bluxte.net/musings/2008/08/25/dont-forget-set-javaneturl-default-timeouts/
Any thoughts?
Error Logs
org.apache.kafka.common.KafkaException: Failed to construct kafka consumer
at org.apache.kafka.clients.consumer.KafkaConsumer.<init>(KafkaConsumer.java:823)
at org.apache.kafka.clients.consumer.KafkaConsumer.<init>(KafkaConsumer.java:667)
at com.mycompany.mypackageclient.internal.BasicKafkaConsumer.<init>(BasicKafkaConsumer.java:19)
at MYAPPcom.mycompany.mypackageclient.internal.KafkaAvroConsumer.<init>(KafkaAvroConsumer.java:15)
at com.mycompany.mypackageclient.consumer.mycompanyKafkaAvroConsumer.getConsumer(mycompanyKafkaAvroConsumer.java:158)
at com.mycompany.mypackageclient.consumer.mycompanyKafkaAvroConsumer.getKafkaConsumer(mycompanyKafkaAvroConsumer.java:111)
at com.mycompany.mypackageclient.MYAPPClient.MYAPPKafkaConsumer(MYAPPClient.java:107)
at com.mycompany.foo.mypackage.myappConsumer.run(myappConsumer.java:109)
at com.mycompany.foo.mypackage.myappConsumer$$Lambda$603/0x000000000103e660.run(Unknown Source)
at java.base/java.lang.Thread.run(Thread.java:836)
Caused by: org.apache.kafka.common.KafkaException: javax.security.auth.login.LoginException: io.strimzi.kafka.oauth.common.HttpException: POST request to https://auth.mycompany.com/auth/realms/pro-realm/protocol/openid-connect/token failed with status 504: GATEWAY_TIMEOUT
at io.strimzi.kafka.oauth.common.HttpUtil.handleResponse(HttpUtil.java:148)
at io.strimzi.kafka.oauth.common.HttpUtil.request(HttpUtil.java:129)
at io.strimzi.kafka.oauth.common.HttpUtil.post(HttpUtil.java:62)
at io.strimzi.kafka.oauth.common.OAuthAuthenticator.post(OAuthAuthenticator.java:92)
at io.strimzi.kafka.oauth.common.OAuthAuthenticator.loginWithClientSecret(OAuthAuthenticator.java:60)
at io.strimzi.kafka.oauth.client.JaasClientOauthLoginCallbackHandler.handleCallback(JaasClientOauthLoginCallbackHandler.java:158)
at io.strimzi.kafka.oauth.client.JaasClientOauthLoginCallbackHandler.handle(JaasClientOauthLoginCallbackHandler.java:138)
at org.apache.kafka.common.security.oauthbearer.OAuthBearerLoginModule.identifyToken(OAuthBearerLoginModule.java:316)
at org.apache.kafka.common.security.oauthbearer.OAuthBearerLoginModule.login(OAuthBearerLoginModule.java:301)
at java.base/javax.security.auth.login.LoginContext.invoke(LoginContext.java:726)
at java.base/javax.security.auth.login.LoginContext$4.run(LoginContext.java:665)
at java.base/javax.security.auth.login.LoginContext$4.run(LoginContext.java:663)
at java.base/java.security.AccessController.doPrivileged(AccessController.java:770)
at java.base/javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:663)
at java.base/javax.security.auth.login.LoginContext.login(LoginContext.java:574)
at org.apache.kafka.common.security.oauthbearer.internals.expiring.ExpiringCredentialRefreshingLogin.login(ExpiringCredentialRefreshingLogin.java:204)
at org.apache.kafka.common.security.oauthbearer.internals.OAuthBearerRefreshingLogin.login(OAuthBearerRefreshingLogin.java:150)
at org.apache.kafka.common.security.authenticator.LoginManager.<init>(LoginManager.java:62)
at org.apache.kafka.common.security.authenticator.LoginManager.acquireLoginManager(LoginManager.java:105)
at org.apache.kafka.common.network.SaslChannelBuilder.configure(SaslChannelBuilder.java:158)
at org.apache.kafka.common.network.ChannelBuilders.create(ChannelBuilders.java:157)
at org.apache.kafka.common.network.ChannelBuilders.clientChannelBuilder(ChannelBuilders.java:73)
at org.apache.kafka.clients.ClientUtils.createChannelBuilder(ClientUtils.java:105)
at org.apache.kafka.clients.consumer.KafkaConsumer.<init>(KafkaConsumer.java:740)
at org.apache.kafka.clients.consumer.KafkaConsumer.<init>(KafkaConsumer.java:667)
at com.mycompany.mypackageclient.internal.BasicKafkaConsumer.<init>(BasicKafkaConsumer.java:19)
at com.mycompany.mypackageclient.internal.KafkaAvroConsumer.<init>(KafkaAvroConsumer.java:15)
at com.mycompany.mypackageclient.consumer.mycompanyKafkaAvroConsumer.getConsumer(mycompanyKafkaAvroConsumer.java:158)
at com.mycompany.mypackageclient.consumer.mycompanyKafkaAvroConsumer.getKafkaConsumer(mycompanyKafkaAvroConsumer.java:111)
at com.mycompany.mypackageclient.MYAPPClient.MYAPPKafkaConsumer(MYAPPClient.java:107)
at com.mycompany.foo.mypackage.myappConsumer.run(myappConsumer.java:109)
at com.mycompany.foo.mypackage.myappConsumer$$Lambda$603/0x000000000103e660.run(Unknown Source)
at java.base/java.lang.Thread.run(Thread.java:836)
Second log from another app connection failure. These are likely not related, but recording it here because it's an app startup connection problem.
021-05-12 14:41:57.131 INFO [kafka-producer-network-thread | producer-1] [o.a.k.c.Metadata] - [Producer clientId=producer-1] Cluster ID: foo
2021-05-12 14:41:57.526 ERROR [myapp] [o.a.k.c.s.o.OAuthBearerLoginModule] - Couldn't kickstart handshaking
javax.net.ssl.SSLException: Couldn't kickstart handshaking
at java.base/sun.security.ssl.Alert.createSSLException(Alert.java:127)
at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:350)
at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:293)
at java.base/sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:450)
at java.base/sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:411)
at java.base/sun.net.www.protocol.https.HttpsClient.afterConnect(HttpsClient.java:567)
at java.base/sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:185)
at java.base/sun.net.www.protocol.https.HttpsURLConnectionImpl.connect(HttpsURLConnectionImpl.java:168)
at io.strimzi.kafka.oauth.common.HttpUtil.request(HttpUtil.java:118)
at io.strimzi.kafka.oauth.common.HttpUtil.post(HttpUtil.java:62)
at io.strimzi.kafka.oauth.common.OAuthAuthenticator.post(OAuthAuthenticator.java:92)
at io.strimzi.kafka.oauth.common.OAuthAuthenticator.loginWithClientSecret(OAuthAuthenticator.java:60)
at io.strimzi.kafka.oauth.client.JaasClientOauthLoginCallbackHandler.handleCallback(JaasClientOauthLoginCallbackHandler.java:158)
at io.strimzi.kafka.oauth.client.JaasClientOauthLoginCallbackHandler.handle(JaasClientOauthLoginCallbackHandler.java:138)
at org.apache.kafka.common.security.oauthbearer.OAuthBearerLoginModule.identifyToken(OAuthBearerLoginModule.java:316)
at org.apache.kafka.common.security.oauthbearer.OAuthBearerLoginModule.login(OAuthBearerLoginModule.java:301)
at java.base/javax.security.auth.login.LoginContext.invoke(LoginContext.java:726)
at java.base/javax.security.auth.login.LoginContext$4.run(LoginContext.java:665)
at java.base/javax.security.auth.login.LoginContext$4.run(LoginContext.java:663)
at java.base/java.security.AccessController.doPrivileged(AccessController.java:770)
at java.base/javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:663)
at java.base/javax.security.auth.login.LoginContext.login(LoginContext.java:574)
at org.apache.kafka.common.security.oauthbearer.internals.expiring.ExpiringCredentialRefreshingLogin.login(ExpiringCredentialRefreshingLogin.java:204)
at org.apache.kafka.common.security.oauthbearer.internals.OAuthBearerRefreshingLogin.login(OAuthBearerRefreshingLogin.java:150)
at org.apache.kafka.common.security.authenticator.LoginManager.<init>(LoginManager.java:62)
at org.apache.kafka.common.security.authenticator.LoginManager.acquireLoginManager(LoginManager.java:105)
at org.apache.kafka.common.network.SaslChannelBuilder.configure(SaslChannelBuilder.java:158)
at org.apache.kafka.common.network.ChannelBuilders.create(ChannelBuilders.java:157)
at org.apache.kafka.common.network.ChannelBuilders.clientChannelBuilder(ChannelBuilders.java:73)
at org.apache.kafka.clients.ClientUtils.createChannelBuilder(ClientUtils.java:105)
at org.apache.kafka.clients.consumer.KafkaConsumer.<init>(KafkaConsumer.java:740)
at org.apache.kafka.clients.consumer.KafkaConsumer.<init>(KafkaConsumer.java:667)
at com.mycompany.myappclient.internal.BasicKafkaConsumer.<init>(BasicKafkaConsumer.java:19)
at com.mycompany.myappclient.internal.KafkaAvroConsumer.<init>(KafkaAvroConsumer.java:15)
at com.mycompany.myappclient.internal.ReadSchemaTopic.getConsumer(ReadSchemaTopic.java:149)
at com.mycompany.myappclient.internal.ReadSchemaTopic.readSchema(ReadSchemaTopic.java:36)
at com.mycompany.myappclient.consumer.mycompanyKafkaAvroConsumer.getKafkaConsumer(mycompanyKafkaAvroConsumer.java:104)
at com.mycompany.myappclient.myappClient.myappKafkaConsumer(myappClient.java:107)
at com.mycompany.foo.myapp.myappConsumer.connect(myappConsumer.java:101)
at com.mycompany.foo.myapp.myappConsumer.connectLoop(myappConsumer.java:123)
at com.mycompany.foo.myapp.myappConsumer.run(myappConsumer.java:145)
at com.mycompany.foo.myapp.myappConsumer$$Lambda$604/0x00000000e54d5920.run(Unknown Source)
at java.base/java.lang.Thread.run(Thread.java:836)
Suppressed: java.net.SocketException: Broken pipe (Write failed)
at java.base/java.net.SocketOutputStream.socketWrite0(Native Method)
at java.base/java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:110)
at java.base/java.net.SocketOutputStream.write(SocketOutputStream.java:150)
at java.base/sun.security.ssl.SSLSocketOutputRecord.encodeAlert(SSLSocketOutputRecord.java:81)
at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:381)
... 41 common frames omitted
Caused by: java.net.SocketException: Connection reset by peer (Write failed)
at java.base/java.net.SocketOutputStream.socketWrite0(Native Method)
at java.base/java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:110)
at java.base/java.net.SocketOutputStream.write(SocketOutputStream.java:150)
at java.base/sun.security.ssl.SSLSocketOutputRecord.flush(SSLSocketOutputRecord.java:251)
at java.base/sun.security.ssl.HandshakeOutStream.flush(HandshakeOutStream.java:89)
at java.base/sun.security.ssl.ClientHello$ClientHelloKickstartProducer.produce(ClientHello.java:658)
at java.base/sun.security.ssl.SSLHandshake.kickstart(SSLHandshake.java:525)
at java.base/sun.security.ssl.ClientHandshakeContext.kickstart(ClientHandshakeContext.java:107)
at java.base/sun.security.ssl.TransportContext.kickstart(TransportContext.java:233)
at java.base/sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:433)
... 39 common frames omitted
2021-05-12 14:41:57.528 ERROR [myapp] [c.n.n.m.n.myappConsumer] - TOPIC-10: Error setting up myapp: KafkaException: Failed to construct kafka consumer
org.apache.kafka.common.KafkaException: Failed to construct kafka consumer
at org.apache.kafka.clients.consumer.KafkaConsumer.<init>(KafkaConsumer.java:823)
at org.apache.kafka.clients.consumer.KafkaConsumer.<init>(KafkaConsumer.java:667)
at com.mycompany.myappclient.internal.BasicKafkaConsumer.<init>(BasicKafkaConsumer.java:19)
at com.mycompany.myappclient.internal.KafkaAvroConsumer.<init>(KafkaAvroConsumer.java:15)
at com.mycompany.myappclient.internal.ReadSchemaTopic.getConsumer(ReadSchemaTopic.java:149)
at com.mycompany.myappclient.internal.ReadSchemaTopic.readSchema(ReadSchemaTopic.java:36)
at com.mycompany.myappclient.consumer.mycompanyKafkaAvroConsumer.getKafkaConsumer(mycompanyKafkaAvroConsumer.java:104)
at com.mycompany.myappclient.myappClient.myappKafkaConsumer(myappClient.java:107)
at com.mycompany.foo.myapp.myappConsumer.connect(myappConsumer.java:101)
at com.mycompany.foo.myapp.myappConsumer.connectLoop(myappConsumer.java:123)
at com.mycompany.foo.myapp.myappConsumer.run(myappConsumer.java:145)
at com.mycompany.foo.myapp.myappConsumer$$Lambda$604/0x00000000e54d5920.run(Unknown Source)
at java.base/java.lang.Thread.run(Thread.java:836)
Caused by: org.apache.kafka.common.KafkaException: javax.security.auth.login.LoginException: An internal error occurred while retrieving token from callback handler
at org.apache.kafka.common.network.SaslChannelBuilder.configure(SaslChannelBuilder.java:172)
at org.apache.kafka.common.network.ChannelBuilders.create(ChannelBuilders.java:157)
at org.apache.kafka.common.network.ChannelBuilders.clientChannelBuilder(ChannelBuilders.java:73)
at org.apache.kafka.clients.ClientUtils.createChannelBuilder(ClientUtils.java:105)
at org.apache.kafka.clients.consumer.KafkaConsumer.<init>(KafkaConsumer.java:740)
... 12 common frames omitted
Caused by: javax.security.auth.login.LoginException: An internal error occurred while retrieving token from callback handler
at org.apache.kafka.common.security.oauthbearer.OAuthBearerLoginModule.identifyToken(OAuthBearerLoginModule.java:319)
at org.apache.kafka.common.security.oauthbearer.OAuthBearerLoginModule.login(OAuthBearerLoginModule.java:301)
at java.base/javax.security.auth.login.LoginContext.invoke(LoginContext.java:726)
at java.base/javax.security.auth.login.LoginContext$4.run(LoginContext.java:665)
at java.base/javax.security.auth.login.LoginContext$4.run(LoginContext.java:663)
at java.base/java.security.AccessController.doPrivileged(AccessController.java:770)
at java.base/javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:663)
at java.base/javax.security.auth.login.LoginContext.login(LoginContext.java:574)
at org.apache.kafka.common.security.oauthbearer.internals.expiring.ExpiringCredentialRefreshingLogin.login(ExpiringCredentialRefreshingLogin.java:204)
at org.apache.kafka.common.security.oauthbearer.internals.OAuthBearerRefreshingLogin.login(OAuthBearerRefreshingLogin.java:150)
at org.apache.kafka.common.security.authenticator.LoginManager.<init>(LoginManager.java:62)
at org.apache.kafka.common.security.authenticator.LoginManager.acquireLoginManager(LoginManager.java:105)
at org.apache.kafka.common.network.SaslChannelBuilder.configure(SaslChannelBuilder.java:158)
... 16 common frames omitted
2021-05-12 14:41:58.023 INFO [main] [o.s.b.a.s.s.UserDetailsServiceAutoConfiguration] -
@mstruk Didn't you see some similar issue?
These exceptions all seem to be network related in one way or another.
The first one (504 Gateway error) properly performs an HTTP request, but a reverse proxy in front of the Keycloak seems to be unable to forward the request to Keycloak. No socket connect timeout setting on the client can help here if any kind of delay occurs after successfully connecting to reverse proxy / gateway in the first place.
The second one (connection reset by peer) means that the reverse proxy / gateway dropped the connection in the middle of SSL negotiation. No client network setting can help with that.
I would look for additional clues on the server side - on the gateway / reverse proxy (why periodically the reverse proxy / gateway fails to connect to Keycloak) , on the Keycloak - if Keycloak fails to connect to its database for example, and also the routing in front of the reverse proxy / gateway. More logs may give you more clues but since it's networking error, the only thing you can do on the client is to retry the connection.
To retry, you can use the technique described here.
One other possibility would be to introduce a retry mechanism into strimzi-kafka-oauth token retrieval logic itself. That could certainly be done and would improve robustness in the face of a fragile network. But then, a fragile network may cause other issues anyway.
The enhancement you suggest might help as well.
The default URLConnection
implementation uses the following two system properties to set the default for connectTimeout
and readTimeout
:
sun.net.client.defaultConnectTimeout
sun.net.client.defaultReadTimeout
As documented here.
But that sets this as default for all uses of java.net.URLConnection
.
We could add the ability to explicitly configure the connectTimeout
and readTimeout
for the token acquisition connection through strimzi-kafka-oauth configuration, using the values to set these for individual connection.
One interesting data point is that I can't reproduce the 504 errors running hundreds of curl post jobs in parallel.
The retry mechanism into strimzi-kafka-oauth might be useful because the exception is propagated upwards causing the app to be torn down. This might be a bit heavy given the transient nature of the error?
The following stack overflow thread sounds like my problem -
@jrivers96 The recently released 0.8.0 client does not use Keycloak anymore. But I'm not sure if @mstruk used Keycloak for anything else than parsing the tokens.
We never used Keycloak adapter so I think this refers more to the fact that we use the default behaviour of java.net.HttpsURLConnection to reuse keep alive connection (pool them for a while). The problem here I'd say has to do with the fact that the reverse proxy or gateway in front of the SSO seems to either return a valid HTTP response (e.g. 504 Gateway error) or drop a connection upon trying to reuse it.
You can try and disable the keep-alive behavior by setting System propertyhttp.keepAlive
to false
as described here.
I think this problem is related to a keycloak memory leak. keycloak/keycloak#9647. It also could be related to inefficient usage of the keycloak resource pool by strimzi related to timeouts.
I'm going to close this issue for now as it doesn't seem to be related to strimzi. I'll open it back up when I have more evidence about this.