strimzi / strimzi-kafka-operator

Apache Kafka® running on Kubernetes

Home Page:https://strimzi.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Enhancement]: hot reload Kafka on changes in `brokerCertChainAndKey` instead of a rolling update

vpedosyuk opened this issue · comments

Related problem

From the docs:

When the certificate or key in the brokerCertChainAndKey secret is updated, the operator will automatically detect it in the next reconciliation and trigger a rolling update of the Kafka brokers to reload the certificate.

In an environment where a Kafka broker restart is very undesirable, it becomes hard to keep external TLS certificates short-lived (e.g. 24 hours with a 3rd-party PKI) because each change of certificates will cause a Kafka restart and usually a downtime.

In general, it'd be great to have as few reasons for a broker restart as possible.

Suggested solution

Once a Kubernetes secret referenced in brokerCertChainAndKey got changed, Strimzi Operator will dynamically replace old certificates with the new ones without restarting the brokers.

Alternatives

A proper HA configuration might reduce the effects of such restarts but it's not always possible.

Additional context

It seems like Kafka itself supports hot-swapping of certificates.

Isn't this already tracked in some other issue? In any case, it should be kept in mind that:

  • Improved support for reloading certificates without any major limitations such as DN changes was added only in Kafka 3.7.0. So it is not easy to implement this while supporting Kafka 3.6.x.
  • Updating the configuration is only one part of the problem as you also need to be able to prepare / load the certificates on the fly.
  • In reality, the benefits will be also limited as there are parts where TLS certificates might not be reloadable even with Kafka 3.7.0 (for example because they are part of a plugin configurations). So in many setups, the rolling updates will be still needed.

I do not want to make it sound like this is not worth the effort -> just pointing out that this is not as simple as it might sound and has some obstacles. (I actually wrote the KIP-978 in Kafka exactly for this purpose, it just takes a long time to bubble through)

@scholzj yes, I've seen your KIP, thanks. In our case SAN and DN remain unchanged, the only thing that changes is expiration time, which is a common case for certificates renewal I believe.

The problem is that unless you can change it all the time, it is basically not feasible because of the complexity. So that is why that KIP is important as it should allow to use it all the time (for the Kafka parts at least).

Understood. Anyways, thank you for your efforts!

P.S. I couldn't find a similar issue reported here, hence, created one.

Discussed on the community call on 18.4.2024: Should be kept and implemented. A proposal will be needed.