marcintustin / docker-kafka-connect

Docker Image for kafka-connect

Home Page:https://hub.docker.com/r/1ambda/kafka-connect/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Status: Development

TODO:

  1. Restore building of extras in container

  2. Write up how this was developed

  3. Create auto build in docker hub

  4. Add to extension section

  5. Add contributing section

  6. Add credit for 1ambda

docker-kafka-connect

Dockerized Apache Kafka Connect (distributed mode) , with additional dependencies added to allow using the hdfs connector from Confluent

Why

The docker image from confluent does not work reliably, at least for me. It mysteriously just falls over without error messages in my testing.

What they don't tell you about Kafka Connect

Kafka Connect is pretty well documented, and Confluent have done a good job of producing software that puts a lot of meat on the bones.

Unfortunately, the Confluent docs are far from comprehensive, or even accurate. In addition there's something very, very important they don't tell you: Out of the box, you can't mix schemaful and schemaless formats.

If you want to output data in a schemaful format (avro, parquet), you need to put data into kafka in one of two schemaful formats: avro, or the special "wrapperised" json format of {"schema": {...schema obj...}, "payload": {...payload obj}}. The latter format also doesn't support avro records, so you really only have one and half formats available.

Out of the box, if you want to send arbitrary (but completely uniform) JSON into kafka, you have to write out again as lines of json objects.

What this image gives you

This images gives you three things:

  1. A nicely working Confluent-enhanced kafka connect, demonstrating json to json, and avro to avro functionality
  2. Plain kafka connect
  3. My own JsonAvro converter class, which allows you to smoothly transition from a JSON-based workflow to a schemaful workflow. (Pending)

The secret sauce

Dependencies have been added to the POM file in the order they are needed by execution. This prevents the various dependency jars from shadowing each others classes.

Supported Tags

Testing

Required dependencies:

  • docker-compose
  • avro-tools
  • bash
  • curl
bash tests/test.sh 1.1.0

Quick Start

with Docker Compose

See docker-compose.yml

Environment Variables

Pass env variables starting with CONNECT_ to configure connect-distributed.properties.
For example, If you want to set offset.flush.interval.ms=15000, use CONNECT_OFFSET_FLUSH_INTERVAL_MS=15000

  • (required) CONNECT_BOOTSTRAP_SERVERS
  • (recommended): CONNECT_GROUP_ID (default value: connect-cluster)
  • (recommended) CONNECT_REST_ADVERTISED_HOST_NAME
  • (recommended) CONNECT_REST_ADVERTISED_PORT

Other connect configuration fields are optional. (see also Kafka Connect Configs)

How To Extend This Image

Fork the repository, and add additional depedencies to pom.xml. These will be compiled into an uberjar placed on the classpath

Development

  • SCALA_VERSION: 2.11
  • KAFKA_VERSION: 0.10.0.0
  • KAFKA_HOME: /opt/kafka_${SCALA_VERSION}-${KAFKA_VERSION}
  • CONNECT_CFG: ${KAFKA_HOME}/config/connect-distributed.properties
  • CONNECT_BIN: ${KAFKA_HOME}/bin/connect-distributed.sh
  • CONNECT_PORT: 8083 (exposed)
  • JMX_PORT: 9999 (exposed)

License

Apache 2.0

About

Docker Image for kafka-connect

https://hub.docker.com/r/1ambda/kafka-connect/

License:Apache License 2.0


Languages

Language:Shell 100.0%