emitskevich/kafka-reo

What is it

Kafka Replicator Exactly-Once - the tool to replicate data with exactly-once guarantee between different clusters of Apache Kafka.

See launch options below for quick start.

The problem

Why one need replication between clusters

Making a copy of data for disaster recovery purposes.
Gathering data from different regions to the central one for aggregation.
Data sharing between organizations.
...

What tools exist for cross-cluster replication

MirrorMaker from Apache Kafka.
Replicator from Confluent.
Simple self-made "consume-produce in the loop" application.

These tools provide either at-most-once or at-least-once delivery.

What about exactly-once

Apache Kafka has transactional API, which can be used for exactly-once delivery. The fundamental idea is to commit consumer offset and producer records in a single transaction. Kafka Streams uses it to provide high-level abstraction and easy access to exactly-once benefits. It may be enabled literally without code change, with one config option.

This works only within the same Kafka cluster.

What's not enough

Replication tools from the list above are not compatible with exactly-once delivery. The reason is in such case consumer offsets and producer records live in different clusters. Apache Kafka can't wrap operations with different clusters in one transaction. There is Kafka Improvement Proposal how to make it possible and good reading about it, but it exists from 2020 and nothing of it is implemented in 2023.

The solution

Theory

Replicate messages to destination cluster with at-least-once guarantee. Wrap the messages with some metadata and apply repartitioning.
Apply deduplication, unwrap and restore initial partitioning, using exactly-once delivery within the destination cluster.

As a drawback, it requires about 2 times more processing as compared with usual at-least-once replication.

Design schema

This is screenshot of design-schema.drawio from the project root.

Launch options

Docker run

Set your bootstrap servers and topic name for both clusters and run:

docker run \
    -e KAFKA_CLUSTERS_SOURCE_BOOTSTRAP_SERVERS=source-kafka-cluster:9092 \
    -e KAFKA_CLUSTERS_SOURCE_TOPIC=source-topic \
    -e KAFKA_CLUSTERS_DESTINATION_BOOTSTRAP_SERVERS=destination-kafka-cluster:9092 \
    -e KAFKA_CLUSTERS_DESTINATION_TOPIC=destination-topic \
    emitskevich/kafka-reo

Or firstly build it from sources:

./gradlew check installDist
docker build . -t emitskevich/kafka-reo --build-arg MODULE=replicator
docker run \
    -e KAFKA_CLUSTERS_SOURCE_BOOTSTRAP_SERVERS=source-kafka-cluster:9092 \
    -e KAFKA_CLUSTERS_SOURCE_TOPIC=source-topic \
    -e KAFKA_CLUSTERS_DESTINATION_BOOTSTRAP_SERVERS=destination-kafka-cluster:9092 \
    -e KAFKA_CLUSTERS_DESTINATION_TOPIC=destination-topic \
    emitskevich/kafka-reo

Docker compose

Replace env vars in docker-compose.yml, then run:

docker-compose up

Or firstly build it from sources:

./gradlew check installDist
docker-compose build
docker-compose up

Kubernetes

Replace env vars in k8s-deployment.yml, then run:

kubectl apply -f k8s-deployment.yml

Best practices

Launch as close to destination cluster as possible. It has notable performance boost, since the step of deduplication uses transactional API of destination cluster and is latency-sensible.

emitskevich / kafka-reo