IvanWoo / kafka-on-kubernetes

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

kafka-on-kubernetes

prerequisites

setup

tl;dr: ./scripts/up.sh

namespace

kubectl create namespace kafka --dry-run=client -o yaml | kubectl apply -f -
kubectl create namespace opensearch --dry-run=client -o yaml | kubectl apply -f -

kafka

install Strimzi

kubectl create -f 'https://strimzi.io/install/latest?namespace=kafka' -n kafka

deploy the kafka cluster

kubectl apply -f kafka/values.yaml -n kafka

wait the cluster to be ready

kubectl wait kafka/my-kafka-cluster --for=condition=Ready --timeout=300s -n kafka

install Kafka-UI

helm repo add kafka-ui https://provectus.github.io/kafka-ui-charts
helm upgrade --install my-kafka-ui kafka-ui/kafka-ui --namespace kafka -f kafka-ui/values.yaml
kubectl port-forward svc/my-kafka-ui -n kafka 8080:80

visit the Kafka-UI

opensearch

follow the OpenSearch guide to deploy the opensearch service

helm repo add opensearch https://opensearch-project.github.io/helm-charts/
helm repo update
helm upgrade --install my-opensearch opensearch/opensearch --namespace opensearch -f opensearch/values.yaml
helm upgrade --install my-opensearch-dashboards opensearch/opensearch-dashboards --namespace opensearch -f opensearch-dashboards/values.yaml

port-forward the opensearch dashboard service

kubectl port-forward svc/my-opensearch-dashboards -n opensearch 5601

and visit the opensearch dashboard with the following credentials:

username: admin
password: admin

verify the opensearch service by testing these operations on the opensearch dashboard

operations

topics

create a topic

attention: the replication-factor <= number of kafka brokers

kubectl -n kafka run kafka-topic-operator -ti --image=quay.io/strimzi/kafka:0.30.0-kafka-3.2.0 --rm=true --restart=Never -- bin/kafka-topics.sh --bootstrap-server my-kafka-cluster-kafka-bootstrap:9092 --create --topic my-first-topic --partitions 1 --replication-factor 1

list topics

kubectl -n kafka run kafka-topic-operator -ti --image=quay.io/strimzi/kafka:0.30.0-kafka-3.2.0 --rm=true --restart=Never -- bin/kafka-topics.sh --bootstrap-server my-kafka-cluster-kafka-bootstrap:9092 --list

describe a topic

kubectl -n kafka run kafka-topic-operator -ti --image=quay.io/strimzi/kafka:0.30.0-kafka-3.2.0 --rm=true --restart=Never -- bin/kafka-topics.sh --bootstrap-server my-kafka-cluster-kafka-bootstrap:9092 --describe --topic my-first-topic

delete a topic

kubectl -n kafka run kafka-topic-operator -ti --image=quay.io/strimzi/kafka:0.30.0-kafka-3.2.0 --rm=true --restart=Never -- bin/kafka-topics.sh --bootstrap-server my-kafka-cluster-kafka-bootstrap:9092 --delete --topic my-first-topic

messages

send some messages

kubectl -n kafka run kafka-producer -ti --image=quay.io/strimzi/kafka:0.30.0-kafka-3.2.0 --rm=true --restart=Never -- bin/kafka-console-producer.sh --bootstrap-server my-kafka-cluster-kafka-bootstrap:9092 --topic my-first-topic --property parse.key=true --property key.separator=:

receive some messages

kubectl -n kafka run kafka-consumer -ti --image=quay.io/strimzi/kafka:0.30.0-kafka-3.2.0 --rm=true --restart=Never -- bin/kafka-console-consumer.sh --bootstrap-server my-kafka-cluster-kafka-bootstrap:9092 --topic my-first-topic --from-beginning --formatter kafka.tools.DefaultMessageFormatter --property print.timestamp=true --property print.key=true --property print.value=true

consumer groups

create the topic with multiple partitions

kubectl -n kafka run kafka-topic-operator -ti --image=quay.io/strimzi/kafka:0.30.0-kafka-3.2.0 --rm=true --restart=Never -- bin/kafka-topics.sh --bootstrap-server my-kafka-cluster-kafka-bootstrap:9092 --create --topic my-first-consumer-group-topic --partitions 3 --replication-factor 1

create the consumer group

kubectl -n kafka run kafka-consumer-group-0 -ti --image=quay.io/strimzi/kafka:0.30.0-kafka-3.2.0 --rm=true --restart=Never -- bin/kafka-console-consumer.sh --bootstrap-server my-kafka-cluster-kafka-bootstrap:9092 --topic my-first-consumer-group-topic --group my-first-consumer-group --from-beginning
kubectl -n kafka run kafka-consumer-group-1 -ti --image=quay.io/strimzi/kafka:0.30.0-kafka-3.2.0 --rm=true --restart=Never -- bin/kafka-console-consumer.sh --bootstrap-server my-kafka-cluster-kafka-bootstrap:9092 --topic my-first-consumer-group-topic --group my-first-consumer-group --from-beginning

send some messages

kubectl -n kafka run kafka-producer -ti --image=quay.io/strimzi/kafka:0.30.0-kafka-3.2.0 --rm=true --restart=Never -- bin/kafka-console-producer.sh --bootstrap-server my-kafka-cluster-kafka-bootstrap:9092 --topic my-first-consumer-group-topic

attention: same group will share the message, but different group will receive same message when attaching to the same topic

list all consumer groups

kubectl -n kafka run kafka-consumer-group-operator -ti --image=quay.io/strimzi/kafka:0.30.0-kafka-3.2.0 --rm=true --restart=Never -- bin/kafka-consumer-groups.sh --bootstrap-server my-kafka-cluster-kafka-bootstrap:9092 --list

delete a consumer group

kubectl -n kafka run kafka-consumer-group-operator -ti --image=quay.io/strimzi/kafka:0.30.0-kafka-3.2.0 --rm=true --restart=Never -- bin/kafka-consumer-groups.sh --bootstrap-server my-kafka-cluster-kafka-bootstrap:9092 --delete --group my-first-consumer-group

reset offset of a consumer group to replay the topic

kubectl -n kafka run kafka-consumer-group-operator -ti --image=quay.io/strimzi/kafka:0.30.0-kafka-3.2.0 --rm=true --restart=Never -- bin/kafka-consumer-groups.sh --bootstrap-server my-kafka-cluster-kafka-bootstrap:9092 --reset-offsets --to-earliest --group consumer-opensearch-demo --all-topics --execute

kafka-programming

attention: follow local_dev doc to setup the prerequisites

create topics

kubectl apply -f kafka/topics.yaml -n kafka

java

  • StickyPartitioner to improve the performance of batch producing at ProducerDemoWithCallback.java
  • messages with the same key will be sent to the same partition at ProducerDemoKey.java
  • consumer groups and partition rebalance
    • moving partitions between consumers is called rebalance
    • eager rebalance: all consumers stop and rejoin
    • cooperative rebalance (incremental rebalance): reassign a small subset of the partitions
  • auto offset commit
    • .commitAsync() called periodically between .poll() calls
  • kafka topic availability
    • acks=all(-1) and min.insync.replicas=2 is the most popular option for data durability and availability and allows you to withstand at most the loss of one kafka broker
  • idempotent producer
    • won't introduce duplicates on network error
  • kafka v3.0+ producer safe by default
    • acks=-1
    • enable.idempotence=true
    • max.in.flight.requests.per.connection=5
    • retries=2147483647
  • compression
    • message compression at the producer level
      • Cloudflare benchmarks
      • pros
        • smaller request size
        • low latency
        • better throughput
        • better disk utilization in Kafka
      • cons(minor)
        • producers must commit some CPU cycles to compression
        • consumers must commit some CPU cycles to decompression
      • always use compression at the producer level
    • message compression at the broker/topic level
      • compression.type=producer
  • message batching
    • linger.ms is the time in milliseconds to wait before sending a batch of messages
    • batch.size
  • delivery semantics
    • at most once: offsets are committed as soon as the message is received. If the processing goes wrong, the message will be lost(it won't be read again)
    • at least once (preferred): offsets are committed after the message is processed. If the processing goes wrong, the message will be read again. This can result in duplicate processing of messages. MAke sure your processing is idempotent.
    • exactly once: can be achieved for Kafka => Kafka workflows using the Transactional API (easy with Kafka Streams API). For Kafka => Sink workflows, use an idempotent consumer.

kafka connect

Kafka Connect makes it easy to stream from numerous sources into Kafka and from Kafka into numerous sources, with hundreds of available connectors.

  • configurable data pipelines
  • interactive between external systems with kafka
  • supported by the strimzi operator

kafka streams

Data processing and transformation library within Kafka.

  • Java API
  • exactly-once capabilities
  • one record at a time (no batching)

best practices

partition count and replication factor

  • partition
    • small cluster (< 6 brokers): 3 x brokers
    • big cluster (> 12 brokers): 2 x brokers
    • more partition means more elections to perform for Zookeeper
  • replication
    • at least 2, usually 3, maximum 4
    • higher replications means
      • better durability
      • higher availability
      • but more latency
      • but more disk spaces
  • cluster
    • with Zookeeper
      • max 200,000 partitions - Zookeeper scaling limit
        • 4,000 partitions per broker
    • with Kraft
      • potential for millions of partitions

topics configuration

  • topics are made of partitions, and partitions are made of segments.
  • log cleanup policies
    • delete
      • by time
      • by size
    • compact
      • log compaction
        • keep the most recent values of each key

<message type>.<dataset name>.<data name>.<data format>

cleanup

tl;dr: ./scripts/down.sh

kubectl delete -f kafka/ -n kafka
kubectl delete -f 'https://strimzi.io/install/latest?namespace=kafka' -n kafka
helm uninstall my-kafka-ui -n kafka
helm uninstall my-opensearch -n opensearch
helm uninstall my-opensearch-dashboards -n opensearch
kubectl delete namespace kafka
kubectl delete namespace opensearch

About


Languages

Language:Java 90.9%Language:Shell 9.1%