Status: Development
TODO:
-
Restore building of extras in container
-
Write up how this was developed
-
Create auto build in docker hub
-
Add to extension section
-
Add contributing section
-
Add credit for 1ambda
docker-kafka-connect
Dockerized Apache Kafka Connect (distributed mode) , with additional dependencies added to allow using the hdfs connector from Confluent
Why
The docker image from confluent does not work reliably, at least for me. It mysteriously just falls over without error messages in my testing.
What they don't tell you about Kafka Connect
Kafka Connect is pretty well documented, and Confluent have done a good job of producing software that puts a lot of meat on the bones.
Unfortunately, the Confluent docs are far from comprehensive, or even accurate. In addition there's something very, very important they don't tell you: Out of the box, you can't mix schemaful and schemaless formats.
If you want to output data in a schemaful format (avro, parquet), you need to put data into kafka in one of two schemaful formats: avro, or the special "wrapperised" json format of {"schema": {...schema obj...}, "payload": {...payload obj}}
. The latter format also doesn't support avro records, so you really only have one and half formats available.
Out of the box, if you want to send arbitrary (but completely uniform) JSON into kafka, you have to write out again as lines of json objects.
What this image gives you
This images gives you three things:
- A nicely working Confluent-enhanced kafka connect, demonstrating json to json, and avro to avro functionality
- Plain kafka connect
- My own JsonAvro converter class, which allows you to smoothly transition from a JSON-based workflow to a schemaful workflow. (Pending)
The secret sauce
Dependencies have been added to the POM file in the order they are needed by execution. This prevents the various dependency jars from shadowing each others classes.
Supported Tags
- latest
1.1.0
(2.11) (1.1.0/Dockerfile)
Testing
Required dependencies:
- docker-compose
- avro-tools
- bash
- curl
bash tests/test.sh 1.1.0
Quick Start
with Docker Compose
Environment Variables
Pass env variables starting with CONNECT_
to configure connect-distributed.properties
.
For example, If you want to set offset.flush.interval.ms=15000
, use CONNECT_OFFSET_FLUSH_INTERVAL_MS=15000
- (required)
CONNECT_BOOTSTRAP_SERVERS
- (recommended):
CONNECT_GROUP_ID
(default value:connect-cluster
) - (recommended)
CONNECT_REST_ADVERTISED_HOST_NAME
- (recommended)
CONNECT_REST_ADVERTISED_PORT
Other connect configuration fields are optional. (see also Kafka Connect Configs)
How To Extend This Image
Fork the repository, and add additional depedencies to
pom.xml
. These will be compiled into an uberjar placed on the classpath
Development
- SCALA_VERSION:
2.11
- KAFKA_VERSION:
0.10.0.0
- KAFKA_HOME:
/opt/kafka_${SCALA_VERSION}-${KAFKA_VERSION}
- CONNECT_CFG:
${KAFKA_HOME}/config/connect-distributed.properties
- CONNECT_BIN:
${KAFKA_HOME}/bin/connect-distributed.sh
- CONNECT_PORT:
8083
(exposed) - JMX_PORT:
9999
(exposed)
License
Apache 2.0