kafka-on-kubernetes

prerequisites
setup
operations
kafka-programming
cleanup

prerequisites

Rancher Desktop: 1.4.1
Kubernetes: v1.22.6
kubectl v1.23.3
Helm: v3.7.2

setup

tl;dr: ./scripts/up.sh

namespace

kubectl create namespace kafka --dry-run=client -o yaml | kubectl apply -f -
kubectl create namespace opensearch --dry-run=client -o yaml | kubectl apply -f -

kafka

install Strimzi

kubectl create -f 'https://strimzi.io/install/latest?namespace=kafka' -n kafka

deploy the kafka cluster

kubectl apply -f kafka/values.yaml -n kafka

wait the cluster to be ready

kubectl wait kafka/my-kafka-cluster --for=condition=Ready --timeout=300s -n kafka

install Kafka-UI

helm repo add kafka-ui https://provectus.github.io/kafka-ui-charts

helm upgrade --install my-kafka-ui kafka-ui/kafka-ui --namespace kafka -f kafka-ui/values.yaml

kubectl port-forward svc/my-kafka-ui -n kafka 8080:80

visit the Kafka-UI

opensearch

follow the OpenSearch guide to deploy the opensearch service

helm repo add opensearch https://opensearch-project.github.io/helm-charts/
helm repo update

helm upgrade --install my-opensearch opensearch/opensearch --namespace opensearch -f opensearch/values.yaml
helm upgrade --install my-opensearch-dashboards opensearch/opensearch-dashboards --namespace opensearch -f opensearch-dashboards/values.yaml

port-forward the opensearch dashboard service

kubectl port-forward svc/my-opensearch-dashboards -n opensearch 5601

and visit the opensearch dashboard with the following credentials:

username: admin
password: admin

verify the opensearch service by testing these operations on the opensearch dashboard

operations

topics

create a topic

attention: the replication-factor <= number of kafka brokers

kubectl -n kafka run kafka-topic-operator -ti --image=quay.io/strimzi/kafka:0.30.0-kafka-3.2.0 --rm=true --restart=Never -- bin/kafka-topics.sh --bootstrap-server my-kafka-cluster-kafka-bootstrap:9092 --create --topic my-first-topic --partitions 1 --replication-factor 1

list topics

kubectl -n kafka run kafka-topic-operator -ti --image=quay.io/strimzi/kafka:0.30.0-kafka-3.2.0 --rm=true --restart=Never -- bin/kafka-topics.sh --bootstrap-server my-kafka-cluster-kafka-bootstrap:9092 --list

describe a topic

kubectl -n kafka run kafka-topic-operator -ti --image=quay.io/strimzi/kafka:0.30.0-kafka-3.2.0 --rm=true --restart=Never -- bin/kafka-topics.sh --bootstrap-server my-kafka-cluster-kafka-bootstrap:9092 --describe --topic my-first-topic

delete a topic

kubectl -n kafka run kafka-topic-operator -ti --image=quay.io/strimzi/kafka:0.30.0-kafka-3.2.0 --rm=true --restart=Never -- bin/kafka-topics.sh --bootstrap-server my-kafka-cluster-kafka-bootstrap:9092 --delete --topic my-first-topic

messages

send some messages

kubectl -n kafka run kafka-producer -ti --image=quay.io/strimzi/kafka:0.30.0-kafka-3.2.0 --rm=true --restart=Never -- bin/kafka-console-producer.sh --bootstrap-server my-kafka-cluster-kafka-bootstrap:9092 --topic my-first-topic --property parse.key=true --property key.separator=:

receive some messages

kubectl -n kafka run kafka-consumer -ti --image=quay.io/strimzi/kafka:0.30.0-kafka-3.2.0 --rm=true --restart=Never -- bin/kafka-console-consumer.sh --bootstrap-server my-kafka-cluster-kafka-bootstrap:9092 --topic my-first-topic --from-beginning --formatter kafka.tools.DefaultMessageFormatter --property print.timestamp=true --property print.key=true --property print.value=true

consumer groups

create the topic with multiple partitions

kubectl -n kafka run kafka-topic-operator -ti --image=quay.io/strimzi/kafka:0.30.0-kafka-3.2.0 --rm=true --restart=Never -- bin/kafka-topics.sh --bootstrap-server my-kafka-cluster-kafka-bootstrap:9092 --create --topic my-first-consumer-group-topic --partitions 3 --replication-factor 1

create the consumer group

kubectl -n kafka run kafka-consumer-group-0 -ti --image=quay.io/strimzi/kafka:0.30.0-kafka-3.2.0 --rm=true --restart=Never -- bin/kafka-console-consumer.sh --bootstrap-server my-kafka-cluster-kafka-bootstrap:9092 --topic my-first-consumer-group-topic --group my-first-consumer-group --from-beginning

kubectl -n kafka run kafka-consumer-group-1 -ti --image=quay.io/strimzi/kafka:0.30.0-kafka-3.2.0 --rm=true --restart=Never -- bin/kafka-console-consumer.sh --bootstrap-server my-kafka-cluster-kafka-bootstrap:9092 --topic my-first-consumer-group-topic --group my-first-consumer-group --from-beginning

send some messages

kubectl -n kafka run kafka-producer -ti --image=quay.io/strimzi/kafka:0.30.0-kafka-3.2.0 --rm=true --restart=Never -- bin/kafka-console-producer.sh --bootstrap-server my-kafka-cluster-kafka-bootstrap:9092 --topic my-first-consumer-group-topic

attention: same group will share the message, but different group will receive same message when attaching to the same topic

list all consumer groups

kubectl -n kafka run kafka-consumer-group-operator -ti --image=quay.io/strimzi/kafka:0.30.0-kafka-3.2.0 --rm=true --restart=Never -- bin/kafka-consumer-groups.sh --bootstrap-server my-kafka-cluster-kafka-bootstrap:9092 --list

delete a consumer group

kubectl -n kafka run kafka-consumer-group-operator -ti --image=quay.io/strimzi/kafka:0.30.0-kafka-3.2.0 --rm=true --restart=Never -- bin/kafka-consumer-groups.sh --bootstrap-server my-kafka-cluster-kafka-bootstrap:9092 --delete --group my-first-consumer-group

reset offset of a consumer group to replay the topic

kubectl -n kafka run kafka-consumer-group-operator -ti --image=quay.io/strimzi/kafka:0.30.0-kafka-3.2.0 --rm=true --restart=Never -- bin/kafka-consumer-groups.sh --bootstrap-server my-kafka-cluster-kafka-bootstrap:9092 --reset-offsets --to-earliest --group consumer-opensearch-demo --all-topics --execute

kafka-programming

attention: follow local_dev doc to setup the prerequisites

create topics

kubectl apply -f kafka/topics.yaml -n kafka

java

StickyPartitioner to improve the performance of batch producing at ProducerDemoWithCallback.java
messages with the same key will be sent to the same partition at ProducerDemoKey.java
consumer groups and partition rebalance
- moving partitions between consumers is called rebalance
- eager rebalance: all consumers stop and rejoin
- cooperative rebalance (incremental rebalance): reassign a small subset of the partitions
auto offset commit
- .commitAsync() called periodically between .poll() calls
kafka topic availability
- acks=all(-1) and min.insync.replicas=2 is the most popular option for data durability and availability and allows you to withstand at most the loss of one kafka broker
idempotent producer
- won't introduce duplicates on network error
kafka v3.0+ producer safe by default
- acks=-1
- enable.idempotence=true
- max.in.flight.requests.per.connection=5
- retries=2147483647
compression
- message compression at the producer level
  - Cloudflare benchmarks
  - pros
    - smaller request size
    - low latency
    - better throughput
    - better disk utilization in Kafka
  - cons(minor)
    - producers must commit some CPU cycles to compression
    - consumers must commit some CPU cycles to decompression
  - always use compression at the producer level
- message compression at the broker/topic level
  - compression.type=producer
message batching
- linger.ms is the time in milliseconds to wait before sending a batch of messages
- batch.size
delivery semantics
- at most once: offsets are committed as soon as the message is received. If the processing goes wrong, the message will be lost(it won't be read again)
- at least once (preferred): offsets are committed after the message is processed. If the processing goes wrong, the message will be read again. This can result in duplicate processing of messages. MAke sure your processing is idempotent.
- exactly once: can be achieved for Kafka => Kafka workflows using the Transactional API (easy with Kafka Streams API). For Kafka => Sink workflows, use an idempotent consumer.

kafka connect

Kafka Connect makes it easy to stream from numerous sources into Kafka and from Kafka into numerous sources, with hundreds of available connectors.

configurable data pipelines
interactive between external systems with kafka
supported by the strimzi operator

kafka streams

Data processing and transformation library within Kafka.

Java API
exactly-once capabilities
one record at a time (no batching)

best practices

partition count and replication factor

partition
- small cluster (< 6 brokers): 3 x brokers
- big cluster (> 12 brokers): 2 x brokers
- more partition means more elections to perform for Zookeeper
replication
- at least 2, usually 3, maximum 4
- higher replications means
  - better durability
  - higher availability
  - but more latency
  - but more disk spaces
cluster
- with Zookeeper
  - max 200,000 partitions - Zookeeper scaling limit
    - 4,000 partitions per broker
- with Kraft
  - potential for millions of partitions

topics configuration

topics are made of partitions, and partitions are made of segments.
log cleanup policies
- delete
  - by time
  - by size
- compact
  - log compaction
    - keep the most recent values of each key

Kafka Topic Naming Conventions

<message type>.<dataset name>.<data name>.<data format>

cleanup

tl;dr: ./scripts/down.sh

kubectl delete -f kafka/ -n kafka
kubectl delete -f 'https://strimzi.io/install/latest?namespace=kafka' -n kafka
helm uninstall my-kafka-ui -n kafka
helm uninstall my-opensearch -n opensearch
helm uninstall my-opensearch-dashboards -n opensearch
kubectl delete namespace kafka
kubectl delete namespace opensearch

IvanWoo / kafka-on-kubernetes

kafka-on-kubernetes

prerequisites

setup

namespace

kafka

install Strimzi

deploy the kafka cluster

install Kafka-UI

opensearch

operations

topics

create a topic

list topics

describe a topic

delete a topic

messages

send some messages

receive some messages

consumer groups

kafka-programming

create topics

java

kafka connect

kafka streams

best practices

partition count and replication factor

topics configuration

Kafka Topic Naming Conventions

cleanup

About

Languages