[Bug]: Kafka running out of disk space and crashing
christopher-wong opened this issue · comments
What happened?
When running Jaeger with Kafka (deployed to k8s via Helm chart 3.0.4
) after a few hours of running Kafka crashes due to running out of disk space.
Our configuration uses a 128gb volume for Kafka. I've attached our Jaeger metrics dashboard below. Kafka does not appear to be deleting traces once they've been written to Elastic. Looking at the Ingester (and Kafka consumer lag metric), it doesn't appear that Kafka is seeing excessive back pressure.
Are traces automatically deleted from Kafka once ingested and stored in ES?
Interestingly, only 1/3 Ingester pods appears to be actually reading and writing spans to ES (from the metrics below).
Steps to reproduce
- Deploy Jaeger with Kafka
- After a few hours, Kafka begins crashing due to no disk space left.
Expected behavior
Kafka should not run out of disk space and delete spans once they've been consumed by the Ingester.
Relevant log output
No response
Screenshot
Additional context
No response
Jaeger backend version
v1.56.0
SDK
No response
Pipeline
No response
Stogage backend
ES 8.13.2
Operating system
No response
Deployment model
Kubernetes, Helm
Deployment configs
storage:
type: elasticsearch
elasticsearch:
host: otjaeger2-elasticsearch-master-hl
extraEnv:
- name: ES_TAGS_AS_FIELDS_ALL
value: "true"
- name: ES_USE_ALIASES
value: "true"
- name: ES_USE_ARCHIVE_USE_ALIASES
value: "true"
- name: ES_JAVA_OPTS
value: "-Djava.util.logging.ConsoleHandler.level=ALL"
kafka:
brokers:
- otjaeger2-kafka:9092
provisionDataStore:
cassandra: false
elasticsearch: true
kafka: false
elasticsearch:
global:
kibanaEnabled: true
master:
replicaCount: 3
persistence:
enabled: true
storageClass: "ebs-sc"
accessModes:
- ReadWriteOnce
size: 1000G
heapSize: 6144m
resources:
requests:
cpu: 2
memory: 6Gi
limits:
cpu: 6
memory: 16Gi
data:
replicaCount: 5
# autoscaling:
# enabled: true
# minReplicas: 3
# maxReplicas: 6
# targetCPU: "60"
# targetMemory: "60"
persistence:
enabled: true
storageClass: "ebs-sc"
accessModes:
- ReadWriteOnce
size: 1000Gi
heapSize: 6144m
resources:
requests:
cpu: 2
memory: 6Gi
limits:
cpu: 6
memory: 16Gi
ingest:
replicaCount: 3
heapSize: 1036m
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: 1
memory: 2Gi
coordinating:
replicaCount: 3
heapSize: 1036m
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: 1
memory: 2Gi
kibana:
ingress:
enabled: true
ingressClassName: alb
hostname: '*'
path: '/*'
annotations:
alb.ingress.kubernetes.io/target-type: 'ip'
alb.ingress.kubernetes.io/scheme: internal
kafka:
heapOpts: -Xmx4096m -Xms4096m
kraft:
clusterId: MDI2OTJjM2YtNzYyNi00Nz
controller:
persistence:
size: 128Gi
resources:
limits:
cpu: 4
memory: 8Gi
requests:
cpu: 2
memory: 4Gi
metrics:
kafka:
enabled: true
jmx:
enabled: true
serviceMonitor:
enabled: true
esIndexCleaner:
# must be false for initial deploy, once ES has started, enable and deploy
# will fail if enabled before ES has started
enabled: true
schedule: "55 23 * * *"
numberOfDays: 7
esRollover:
# must be false for initial deploy, once ES has started, enable deploy
# will fail if enabled before ES has started
enabled: true
extraEnv:
- name: CONDITIONS
value: '{"max_age": "3h", "max_size": "25gb"}'
schedule: "*/30 * * * *"
collector:
# must be false for initial deploy, once esIndexCleaner and esRollover have been enabled, enable the collector and deploy again
# this ensures esRolloverInit can create the necessary index aliases before collection begins.
enabled: true
image:
registry: <redacted>
repository: jaeger-collector
tag: "1.56.0"
replicaCount: 5
serviceMonitor:
enabled: true
additionalLabels:
release: base
ingress:
enabled: true
ingressClassName: alb
annotations:
alb.ingress.kubernetes.io/target-type: 'ip'
alb.ingress.kubernetes.io/scheme: internal
alb.ingress.kubernetes.io/certificate-arn: <redacted>
alb.ingress.kubernetes.io/healthcheck-port: '14269'
alb.ingress.kubernetes.io/healthcheck-path: /
pathType: Prefix
hosts:
- host: <redacted>
servicePort: http
- host: <redacted>
servicePort: otlp-http
service:
otlp:
grpc:
name: otlp-grpc
http:
name: otlp-http
resources:
limits:
cpu: 2
memory: 1Gi
requests:
cpu: 250m
memory: 128Mi
ingester:
enabled: false
image:
registry: <redacted>
repository: jaeger-ingester
tag: "1.56.0"
replicaCount: 5
serviceMonitor:
enabled: true
additionalLabels:
release: base
resources:
limits:
cpu: 2
memory: 4Gi
requests:
cpu: 250m
memory: 128Mi
query:
image:
registry: <redacted>
repository: jaeger-query
tag: "1.56.0"
replicaCount: 5
serviceMonitor:
enabled: true
additionalLabels:
release: base
service:
type: LoadBalancer
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: "external"
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: "ip"
service.beta.kubernetes.io/aws-load-balancer-ssl-cert: <redacted>
service.beta.kubernetes.io/aws-load-balancer-ssl-ports: "443"
port: 443
resources:
limits:
cpu: 2
memory: 4Gi
requests:
cpu: 250m
memory: 128Mi
agentSidecar:
enabled: false
agent:
enabled: false
spark:
enabled: false
Jaeger does not manage Kafka, it is installed via bitnami helm chart. Closing as this is not a bug in Jaeger (at least there is no indication in the ticket that it is), feel free to reopen if you have Jaeger-related root cause.