strimzi / strimzi-kafka-bridge

An HTTP bridge for Apache Kafka®

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Excessive Connections to Kafka from Bridge

shubhamg7 opened this issue · comments

We are currently running into issues with the bridge making too many connections to Kafka. We use MSK, which is managed by AWS and has a strict connection limit we're hitting, therefore making our current configuration unfeasible. We are suspecting that the bridge is producing on too many threads, thereby increasing our connection count heavily. With our current configuration we are also required to run the bridge in multiple pods, which exacerbates the problem. We've tried configuring the Kafka producer of the bridge itself, but it had very little impact.

Is there any way to configure the bridge to address this issue? We'd appreciate any additional insight into this matter.

Thank you in advance!

I think you need to provide more details about how you use the bridge, share your configuration, some usage statistics and patterns etc.

Sure let me get that for you.

How we are using the bridge
We inject the bridge container into every single kubernetes pod that needs to communicate with kafka. For testing, we have been using 8 pods (with a separate bridge container attached to each) to publish to kafka.

Usage Statistics
This table can highlight some of the research we have done into this issue. We have been testing different values of linger.ms and how they impact the number of connections being created. We are publishing around 2000 events in a minute.

  LingerMS=0 LingerMS=5 LingerMS=100
Bytes/Sec 450 435 470
ConnectionCount 367 371 365
ConnectionCreationRate 10.9 13 10.2

We are suspecting that the bridge is producing on too many threads, thereby increasing our connection count heavily.

Full Producer Config (straight from the bridge logs)

2022-05-20 18:57:59,544] INFO  <oducerConfig:372> [oop-thread-1] ProducerConfig values: 
	acks = 0
	batch.size = 32768
	bootstrap.servers = [b-1.affirm-live-chrono-1.va062a.c10.kafka.us-east-1.amazonaws.com:9098, b-2.affirm-live-chrono-1.va062a.c10.kafka.us-east-1.amazonaws.com:9098, b-3.affirm-live-chrono-1.va062a.c10.kafka.us-east-1.amazonaws.com:9098]
	buffer.memory = 33554432
	client.dns.lookup = use_all_dns_ips
	client.id = producer-10
	compression.type = none
	connections.max.idle.ms = 540000
	delivery.timeout.ms = 120000
	enable.idempotence = false
	interceptor.classes = []
	internal.auto.downgrade.txn.commit = false
	key.serializer = class org.apache.kafka.common.serialization.ByteArraySerializer
	linger.ms = 100
	max.block.ms = 60000
	max.in.flight.requests.per.connection = 5
	max.request.size = 1048576
	metadata.max.age.ms = 300000
	metadata.max.idle.ms = 300000
	metric.reporters = []
	metrics.num.samples = 2
	metrics.recording.level = INFO
	metrics.sample.window.ms = 30000
	partitioner.class = class org.apache.kafka.clients.producer.internals.DefaultPartitioner
	receive.buffer.bytes = 32768
	reconnect.backoff.max.ms = 1000
	reconnect.backoff.ms = 50
	request.timeout.ms = 30000
	retries = 10
	retry.backoff.ms = 100
	sasl.client.callback.handler.class = class software.amazon.msk.auth.iam.IAMClientCallbackHandler
	sasl.jaas.config = [hidden]
	sasl.kerberos.kinit.cmd = /usr/bin/kinit
	sasl.kerberos.min.time.before.relogin = 60000
	sasl.kerberos.service.name = null
	sasl.kerberos.ticket.renew.jitter = 0.05
	sasl.kerberos.ticket.renew.window.factor = 0.8
	sasl.login.callback.handler.class = null
	sasl.login.class = null
	sasl.login.refresh.buffer.seconds = 300
	sasl.login.refresh.min.period.seconds = 60
	sasl.login.refresh.window.factor = 0.8
	sasl.login.refresh.window.jitter = 0.05
	sasl.mechanism = AWS_MSK_IAM
	security.protocol = SASL_SSL
	security.providers = null
	send.buffer.bytes = 131072
	socket.connection.setup.timeout.max.ms = 30000
	socket.connection.setup.timeout.ms = 10000
	ssl.cipher.suites = null
	ssl.enabled.protocols = [TLSv1.2, TLSv1.3]
	ssl.endpoint.identification.algorithm = https
	ssl.engine.factory.class = null
	ssl.key.password = null
	ssl.keymanager.algorithm = SunX509
	ssl.keystore.certificate.chain = null
	ssl.keystore.key = null
	ssl.keystore.location = null
	ssl.keystore.password = null
	ssl.keystore.type = JKS
	ssl.protocol = TLSv1.3
	ssl.provider = null
	ssl.secure.random.implementation = null
	ssl.trustmanager.algorithm = PKIX
	ssl.truststore.certificates = null
	ssl.truststore.location = null
	ssl.truststore.password = null
	ssl.truststore.type = JKS
	transaction.timeout.ms = 60000
	transactional.id = null
	value.serializer = class org.apache.kafka.common.serialization.ByteArraySerializer

I would not expect linger.ms to change much. But it might be more interesting to understand how are you sending the messages - is it a new connection everytime, or does it maintain the connection etc. @ppatierno might know how it behaves in the different cases.

PS: Out of curiosity - why use the bridge in every single pod instead of using the Kafka client directly?

Does that not depend on the implementation of the bridge? We currently send a request to the bridge on every event. We haven't configured the bridge beyond the producer/consumer parameters.

Regarding your last question, we use the bridge because most of our applications are Python, and getting MSK <> IAM authentication to work from a Python application proved to be too painful.

Does that not depend on the implementation of the bridge? We currently send a request to the bridge on every event. We haven't configured the bridge beyond the producer/consumer parameters.

I mean the HTTP side of things - i.e. HTTP connections etc. Which is what you control in your application.

Regarding your last question, we use the bridge because most of our applications are Python, and getting MSK <> IAM authentication to work from a Python application proved to be too painful.

Ok. Fair enough. Regardless of this particular issue, using the Kafka client directly when possible will be more efficient.

I mean the HTTP side of things - i.e. HTTP connections etc. Which is what you control in your application.

Ah, I understand your question now. Sorry, I wasn't entirely sure if it'd affect connections between the bridge and Kafka. All our Python applications send a separate HTTP request to the bridge for each event. We do not do any sort of batching on the client side. Both the application and bridge are run as Docker images within the same Kubernetes pod, so these HTTP requests are essentially local network calls. In this specific case, the application is a Gunicorn Flask web server. The web server is configured to run in multiple processes.

For the HTTP requests made to the bridge, we use a workaround to ensure that our calls are completely async. The workaround is to essentially forcibly terminate all requests 3ms after making them, thus ignoring the responses from the bridge.

Maybe you should not be closing the connections to reuse the HTTP connection. You can try the difference which it makes in some environments.

  1. First try to run something like this bash script:

    while true;
    do
      curl -X POST \
        $CONNECT_BRIDGE/topics/kafka-test-apps \
        -H 'content-type: application/vnd.kafka.json.v2+json' \
        -d '{
          "records": [
              {
                  "key": "my-key",
                  "value": "Hello World!"
              }
          ]
      }'
    
      echo ""
      sleep 1
    done

    If you check the log from the Bridge, you will see how for every call it opens and closes the connection.

  2. Now try the example producer written in Java which uses the same client and according to my understanding reuses the HTTP connection (see https://github.com/strimzi/client-examples/tree/main/http/vertx). When you run that and check the log, you will see that the bridge does not open a new connection for each HTTP request anymore.

So I guess you need to make your application to behave more like the Java example and less like the curl example.

Thanks for the detail. I will try the two things once I'm back at work on Tuesday, but I trust the results you mentioned. We will try to decrease the amount of connections made to the bridge from the client.

As an aside, I'm a little confused as to how optimizing the connections between the client application and the bridge would help resolve connection counts between the bridge and Kafka. From the Kafka docs, send uses an internal buffer to buffer events before sending them, so the amount of requests between the client and bridge shouldn't matter. That is, unless the internal KafkaProducer isn't shared across threads in the bridge and thus have different buffers in memory?

It is not shared between different HTTP clients and that is IMHO desired.

Sorry to be late but Jakub analysis is right.
For each HTTP connection from the HTTP client, a producer connection is established on the other side.
If you reuse the same HTTP connection to make different requests (to produce messages) the same producer connection will be used by the bridge. If you open and close HTTP connections, then corresponding producer Kafka connections will be opened and closed on the other side and this could hit the AWS limits if done in a very short time frame.

Could you elaborate on why you chose to maintain separate instances of the producer?

From here it seems that sharing a single instance is more performant.

That is a bit simplified.There will be certainly a point when more connection are better. The HTTP bridge does not have the inteligence to understand how many connections you might need. Your app on the other hand knows about such things, so can handle this on the HTTP level by reusing the connection or not.

Also, as I said before, it provides much better separation between the different clients (the general expectation is different HTTP connection => different client). In the future if/when we for example implement more authentication features on the HTTP side, this will be very important. I think it also better corresponds to the multi protocol idea. Across the protocols, the connection level separation (not necessarily HTTP, but for example also AMQP connection) corresponds well to this.

Appreciate the detail, that makes a lot of sense.

For our use-case, we will most likely only need one KafkaProducer per bridge that we run. This is because we scale heavily outwards, and I doubt we'd need more than one instance of a KafkaProducer per bridge (as evidence, we have Java applications that use a single instance of KafkaProducer). I'm thinking of forking this project and modifying the bridge to share a single instance of a KafkaProducer. Do you foresee any issues with this? And if you could provide a good place to start I'd appreciate it very much.

Well, that is your call. You can also just write your own application (since you seem to need only very limited set of features, it might be easier than editing some general purpose codebase) or use something like Apache Camel I guess.

We can close this by the way. Thank you for all the info!