What is it
Kafka Replicator Exactly-Once - the tool to replicate data with exactly-once guarantee between different clusters of Apache Kafka.
See launch options below for quick start.
The problem
Why one need replication between clusters
- Making a copy of data for disaster recovery purposes.
- Gathering data from different regions to the central one for aggregation.
- Data sharing between organizations.
- ...
More in Confluent docs.
What tools exist for cross-cluster replication
- MirrorMaker from Apache Kafka.
- Replicator from Confluent.
- Simple self-made "consume-produce in the loop" application.
These tools provide either at-most-once or at-least-once delivery.
What about exactly-once
Apache Kafka has transactional API, which can be used for exactly-once delivery. The fundamental idea is to commit consumer offset and producer records in a single transaction. Kafka Streams uses it to provide high-level abstraction and easy access to exactly-once benefits. It may be enabled literally without code change, with one config option.
This works only within the same Kafka cluster.
What's not enough
Replication tools from the list above are not compatible with exactly-once delivery. The reason is in such case consumer offsets and producer records live in different clusters. Apache Kafka can't wrap operations with different clusters in one transaction. There is Kafka Improvement Proposal how to make it possible and good reading about it, but it exists from 2020 and nothing of it is implemented in 2023.
The solution
Theory
- Replicate messages to destination cluster with at-least-once guarantee. Wrap the messages with some metadata and apply repartitioning.
- Apply deduplication, unwrap and restore initial partitioning, using exactly-once delivery within the destination cluster.
As a drawback, it requires about 2 times more processing as compared with usual at-least-once replication.
Design schema
This is screenshot of design-schema.drawio
from the project root.
Launch options
Docker run
Set your bootstrap servers and topic name for both clusters and run:
docker run \
-e KAFKA_CLUSTERS_SOURCE_BOOTSTRAP_SERVERS=source-kafka-cluster:9092 \
-e KAFKA_CLUSTERS_SOURCE_TOPIC=source-topic \
-e KAFKA_CLUSTERS_DESTINATION_BOOTSTRAP_SERVERS=destination-kafka-cluster:9092 \
-e KAFKA_CLUSTERS_DESTINATION_TOPIC=destination-topic \
emitskevich/kafka-reo
Or firstly build it from sources:
./gradlew check installDist
docker build . -t emitskevich/kafka-reo --build-arg MODULE=replicator
docker run \
-e KAFKA_CLUSTERS_SOURCE_BOOTSTRAP_SERVERS=source-kafka-cluster:9092 \
-e KAFKA_CLUSTERS_SOURCE_TOPIC=source-topic \
-e KAFKA_CLUSTERS_DESTINATION_BOOTSTRAP_SERVERS=destination-kafka-cluster:9092 \
-e KAFKA_CLUSTERS_DESTINATION_TOPIC=destination-topic \
emitskevich/kafka-reo
Docker compose
Replace env vars in docker-compose.yml
, then run:
docker-compose up
Or firstly build it from sources:
./gradlew check installDist
docker-compose build
docker-compose up
Kubernetes
Replace env vars in k8s-deployment.yml
, then run:
kubectl apply -f k8s-deployment.yml
Best practices
Launch as close to destination cluster as possible. It has notable performance boost, since the step of deduplication uses transactional API of destination cluster and is latency-sensible.