Kafka Snow White
Mirror, mirror on the wall, who is the fairest of them all?
A Kafka mirroring service based on Akka Streams Kafka. The service can be used to move messages between topics on different Kafka clusters.
Getting Kafka Snow White
Kafka Snow White is available as executable JAR files in multiple variations (see below), these can be downloaded from Maven Central:
Pick the xxx-assembly.jar
files where xxx
stands for the app name and version.
The code is also available as a library and can be used via SBT with:
// If you want to run the service from code
libraryDependencies += "com.supersonic" %% "kafka-snow-white-consul-app" % "1.5.2"
libraryDependencies += "com.supersonic" %% "kafka-snow-white-file-watcher" % "1.5.2"
// If you want only the basic mirroring functionality with no backing mechanism
libraryDependencies += "com.supersonic" %% "kafka-snow-white-core" % "1.5.2"
// If you want to create a new backend for mirroring
libraryDependencies += "com.supersonic" %% "kafka-snow-white-app-common" % "1.5.2"
Motivation
Given two different Kafka clusters (cluster1
and cluster2
) we want to move messages from the topic some-topic
on cluster1
to some-topic
on cluster2
.
The Kafka Snow White service lets one define a mirror, that does exactly this (see setup details below). Given the appropriate mirror, if we start the Kafka Snow White service with it, it will connect to cluster1
consume all messages from some-topic
and then produce the messages to some-topic
on cluster2
.
Setup
The Kafka Snow White service is available in two flavors which differ in the way that they are configured. The first takes configuration from files on the file-system and each mirror corresponds to a single file. The second, takes configuration from Consul keys in Consul's KV-store, and a mirror corresponds to a single key. For setup instructions for each flavor, see the README files under file-watcher-app and consul-app.
Mirror Configuration Format
The configuration schema (using the HOCON format) for both flavors of the Kafka Snow White service is the same. The example mirror that was described in the motivation section above can be set up as follows:
consumer {
kafka-clients {
bootstrap.servers = "cluster1:9092"
group.id = "some-group"
auto.offset.reset = "earliest"
}
}
producer {
kafka-clients = {
bootstrap.servers = "cluster2:9092"
}
}
mirror {
whitelist = ["some-topic"]
commitBatchSize = 1000
commitParallelism = 4
}
A mirror specification has three sections:
consumer
- setting up the consumer that will be used by the mirror (this corresponds to the source cluster)producer
- setting up the producer that will be used by the mirror (this corresponds to the target cluster)mirror
- settings for the mirror itself
Since the mirror is backed by the Akka Streams Kafka library, the consumer
and producer
settings are passed directly to it. The full specification of the available settings can be found in the Akka Streams Kafka documentation for the consumer and the producer.
The mirror settings are defined as follows:
mirror {
# The list of topics the mirror listens to (mandatory).
# Unless overridden in the `topicsToRename` field each topic in this list
# will be mirrored in the target cluster.
whitelist = ['some-topic-1', 'some-topic-2']
# How many messages to batch before committing their offsets to Kafka (optional, default 1000).
commitBatchSize = 1000
# The parallelism level used to commit the offsets (optional, default 4).
commitParallelism = 4
# Settings to enable bucketing of mirrored values (optional).
# The (mirrorBuckets / totalBuckets) ratio is the percentage of traffic to be mirrored.
bucketing {
# The number of buckets that should be mirrored (mandatory, no default).
mirrorBuckets = 4
# The total number of buckets, used to calculate the percentage of traffic
# to mirror (mandatory, no default).
totalBuckets = 16
}
# Whether the mirror should be enabled or not (optional, default 'true').
enabled = true
# Whether the partition number for the mirrored messages should be generated by hashing
# the message key, excluding 'null's (optional, default 'false'). When set to false, the partition number
# will be randomly generated.
partitionFromKeys = false
# Map of source to target topics to rename when mirroring messages to the
# target cluster (optional, default empty).
topicsToRename {
some-topic-2 = "renamed-topic"
}
}
Topic Renaming
By default the topic whitelist
defines topics that should be mirrored from the source cluster in the target cluster. This means that each topic in the list will be recreated with the same name in the target cluster. It is possible to override this behavior by providing the topicsToRename
map. In the example above, the whitelist
contains two topics (some-topic-1
and some-topic-2
) and the topicsToRename
map contains a single entry that maps some-topic-2
to renamed-topic
. When the mirror will be run, the contents of some-topic-2
in the source cluster will be mirrored in the target cluster in a new topic called renamed-topic
. While the some-topic-1
topic will be mirrored in the target cluster under the same name (since it does not have an entry in the map).
Bucketing
It is possible to (probabilistically) mirror only some percentage of the messages in a topic, to do this we need to specify the bucketing
settings. As in the example above, we set mirrorBuckets
to 4
and totalBuckets
to 16
, this means that only 4 / 16 = 25%
of the messages will be mirrored by the mirror.
The selection of which messages to mirror is based on the key of the incoming message, meaning that if a key was once mirrored it will always be mirrored when the same key is encountered again. This can be useful when mirroring traffic between production and staging/testing environments (assuming that the key is not null
and has some logical info in it).
Note: this implies that if the keys are all the same then all messages will be either mirrored or not depending on the key's value.
Bucketing is not supported for null
keys, all such messages will be mirrored independent of the bucketing settings.
The Healthcheck Server
Upon startup, the Kafka Snow White service starts a healthcheck server (by default on port 8080
). The healthcheck data can be accessed via:
localhost:8080/healthcheck
The route contains information about the running service including data about the currently running mirrors. This can be useful to troubleshoot the settings of a mirror when trying to set it up for the first time. Note, mirrors that failed to start or are disabled are currently not displayed.
The default port can be overridden by setting the KAFKA_SNOW_WHITE_HEALTHCHECK_PORT
environment variable. For example:
export KAFKA_SNOW_WHITE_HEALTHCHECK_PORT=8081
# ... start the Kafka Snow White service
Or, in a configuration file:
kafka-mirror-settings {
port = 8081
}
Compared to Other Solutions
There are other mirroring solutions out there, notably Apache Kafka's own MirrorMaker and Uber's uReplicator.
Compared to MirrorMaker, Kafka Snow White should be easier to manage:
- The configuration can be reloaded dynamically via files or Consul
- Each mirror is defined independently with no need to restart or affect other mirrors when a given mirror is changed
uReplicator is a more featureful solution but also requires a more complicated setup (e.g., a Helix controller). Compared to it Kafka Snow White can be deployed as a single self-contained JAR file.