Using volumeClaimTemplates result in PVCs being deleted before pods come up
ryanmorris708 opened this issue · comments
I have attempted to deploy a namespace-scope Druid Operator and Druid CR cluster using volumeClaimTemplates for the MiddleManagers and Historicals. However, the PVCs are deleted immediately after creation -- the deletions are logged by the Operator -- and the MiddleManagers and Historicals remain in the Pending state due to the missing PVCs.
I have tried setting deleteOrphanPVC: false and/or disablePVCDeletionFinalizer: true, but these do not have any effect.
Manually provisioning the PVCs before deploying the cluster is not a good option, since the StatefulSet will not allow Pods to bind to different PVCs unless they are provisioned dynamically with the volumeClaimTemplates. Therefore, having multiple Historicals or MiddleManagers is not possible, since they would be forced to use the same segment cache, log files, tmp directory, etc.
Please see my definitions and debug output below. I have omitted the node definitions and logs except for the MiddleManager to shorten this post.
Druid Operator my-values.yaml:
env:
DENY_LIST: ""
RECONCILE_WAIT: "10s"
WATCH_NAMESPACE: "druid"
watchNamespace: true
replicaCount: 1
image:
repository: druidio/druid-operator
pullPolicy: IfNotPresent
tag: ""
imagePullSecrets: []
nameOverride: ""
fullnameOverride: ""
rbac:
create: true
serviceAccount:
create: true
annotations: {}
name: ""
podAnnotations: {}
podSecurityContext: {}
securityContext: {}
resources:
requests:
cpu: 100m
memory: 100Mi
nodeSelector: {}
tolerations: []
affinity: {}
Druid CR my-tiny-cluster.yaml (omitted node definitions except for MM with PVC):
apiVersion: "druid.apache.org/v1alpha1"
kind: "Druid"
metadata:
name: tiny-cluster
spec:
image: apache/druid:0.20.1
startScript: /druid.sh
securityContext:
fsGroup: 1000
runAsUser: 1000
runAsGroup: 1000
services:
- spec:
type: ClusterIP
clusterIP: None
commonConfigMountPath: "/opt/druid/conf/druid/cluster/_common"
jvm.options: |-
-server
-XX:MaxDirectMemorySize=10240g
-Duser.timezone=UTC
-Dfile.encoding=UTF-8
-Dlog4j.debug
-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager
-Djava.io.tmpdir=/opt/druid/var/tmp
log4j.config: |-
<?xml version="1.0" encoding="UTF-8" ?>
<Configuration status="WARN">
<Appenders>
<Console name="Console" target="SYSTEM_OUT">
<PatternLayout pattern="%d{ISO8601} %p [%t] %c - %m%n"/>
</Console>
</Appenders>
<Loggers>
<Root level="info">
<AppenderRef ref="Console"/>
</Root>
</Loggers>
</Configuration>
common.runtime.properties: |
# Zookeeper
druid.zk.service.host=tiny-cluster-zk-0.tiny-cluster-zk
druid.zk.paths.base=/druid
druid.zk.service.compress=false
# Metadata Store
druid.metadata.storage.type=derby
druid.metadata.storage.connector.connectURI=jdbc:derby://localhost:1527/opt/druid/var/derbydb/metadata.db;create=true
druid.metadata.storage.connector.host=localhost
druid.metadata.storage.connector.port=1527
druid.metadata.storage.connector.createTables=true
# Deep Storage
druid.storage.type=local
druid.storage.storageDirectory=/opt/druid/var/deepstorage
#
# Extensions
#
druid.extensions.loadList=["druid-kafka-indexing-service"]
#
# Service discovery
#
druid.selectors.indexing.serviceName=druid/overlord
druid.selectors.coordinator.serviceName=druid/coordinator
druid.indexer.logs.type=file
druid.indexer.logs.directory=/opt/druid/var/indexing-logs
druid.lookup.enableLookupSyncOnStartup=false
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
nodes:
<...>
middlemanagers:
druid.port: 8091
extra.jvm.options: |-
-Xms256m
-Xmx256m
nodeConfigMountPath: /opt/druid/conf/druid/cluster/data/middleManager
nodeType: middleManager
ports:
- containerPort: 8100
name: peon-0
replicas: 1
resources:
requests:
cpu: 100m
memory: 100Mi
runtime.properties: |-
druid.service=druid/middleManager
druid.server.http.numThreads=60
druid.worker.capacity=2
# Indexer settings
druid.indexer.runner.javaOptsArray=["-server", "-Xms6g", "-Xmx6g", "-XX:+UseG1GC", "-XX:MaxDirectMemorySize=3g", "-Duser.timezone=UTC", "-Dfile.encoding=UTF-8", "-Djava.io.tmpdir=/opt/druid/var/tmp", "-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager"]
druid.indexer.task.baseTaskDir=/opt/druid/var/druid/task
# Processing threads and buffers on Peons
druid.indexer.fork.property.druid.processing.numMergeBuffers=2
druid.indexer.fork.property.druid.processing.buffer.sizeBytes=256000000
druid.indexer.fork.property.druid.processing.numThreads=1
# Peon query cache
druid.realtime.cache.useCache=true
druid.realtime.cache.populateCache=true
druid.cache.sizeInBytes=256000000
volumeClaimTemplates:
- metadata:
name: data-volume
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 30Gi
storageClassName: csi-storageclass
volumeMounts:
- mountPath: /opt/druid/var
name: data-volume
Output of "kubectl -n druid get pods":
NAME READY STATUS RESTARTS AGE
druid-operator-nonprod-8585747989-jpmvq 1/1 Running 0 2d21h
druid-tiny-cluster-brokers-0 1/1 Running 0 3m3s
druid-tiny-cluster-coordinators-0 1/1 Running 0 3m2s
druid-tiny-cluster-historicals-0 0/1 Pending 0 3m3s
druid-tiny-cluster-middlemanagers-0 0/1 Pending 0 3m3s
druid-tiny-cluster-routers-0 1/1 Running 0 3m2s
tiny-cluster-zk-0 1/1 Running 0 24m
Output of "kubectl -n druid get pvc":
No resources found in druid namespace.
Output of "kubectl -n druid logs druid-operator-nonprod-8585747989-jpmvq" (omitted node logs except for MM with PVC):
2021-05-14T17:56:03.967Z DEBUG controller-runtime.controller Successfully Reconciled {"controller": "druid", "request": "druid/nonprod"}
2021-05-14T17:56:09.833Z DEBUG controller-runtime.controller Successfully Reconciled {"controller": "druid", "request": "druid/nonprod"}
2021-05-14T17:57:30.013Z INFO druid_operator_handler Created [:tiny-cluster-druid-common-config]. {"Object": <...>}
2021-05-14T17:57:30.046Z INFO druid_operator_handler Updated [Druid:tiny-cluster]. {"Prev Object": <...>}
<...>
2021-05-14T17:57:30.127Z INFO druid_operator_handler Created [:druid-tiny-cluster-middlemanagers-config]. {"Object": <...>}
2021-05-14T17:57:30.156Z INFO druid_operator_handler Created [:druid-tiny-cluster-middlemanagers]. {"Object": <...>}
2021-05-14T17:57:30.199Z INFO druid_operator_handler Created [:druid-tiny-cluster-middlemanagers]. {"Object": <...>}
<...>
2021-05-14T17:57:30.520Z INFO druid_operator_handler Successfully deleted object [PersistentVolumeClaim] in namespace [druid] {"name": "tiny-cluster", "namespace": "druid"}
2021-05-14T17:57:30.599Z INFO druid_operator_handler Updated [StatefulSet:druid-tiny-cluster-middlemanagers]. {"Prev Object": <...>}
<...>
@ryanmorris708 can you confirm which druid operator version are you using ?
Here is my chart.yaml for the Operator:
apiVersion: v2
name: druid-operator
description: Druid Kubernetes Operator
# A chart can be either an 'application' or a 'library' chart.
#
# Application charts are a collection of templates that can be packaged into versioned archives
# to be deployed.
#
# Library charts provide useful utilities or functions for the chart developer. They're included as
# a dependency of application charts to inject those utilities and functions into the rendering
# pipeline. Library charts do not define any templates and therefore cannot be deployed.
type: application
# This is the chart version. This version number should be incremented each time you make changes
# to the chart and its templates, including the app version.
# Versions are expected to follow Semantic Versioning (https://semver.org/)
version: 0.1.1
# This is the version number of the application being deployed. This version number should be
# incremented each time you make changes to the application. Versions are not expected to
# follow Semantic Versioning. They should reflect the version the application is using.
appVersion: 0.0.6
@ryanmorris708 can you send me your storage class yaml please ?
can you try adding this parameter to your storageclass volumeBindingMode: WaitForFirstConsumer
and re-create the issue
The volumeBindingMode is immutable and this StorageClass is in use by other deployments, so I have created an identical StorageClass called csi-storageclass-2 with just the volumeBindingMode differing.
I observed the same behavior where the PVCs were deleted immediately.
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: csi-storageclass-2
annotations:
storageclass.kubernetes.io/is-default-class: "false"
provisioner: csi.vsphere.vmware.com
parameters:
storagepolicyname: Default-Storage
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
Output of "kubectl describe storageclass csi-storageclass" for comparison (the StorageClass that I used for my initial post):
Name: csi-storageclass
IsDefaultClass: Yes
Annotations: storageclass.kubernetes.io/is-default-class=true
Provisioner: csi.vsphere.vmware.com
Parameters: storagepolicyname=Default-Storage
AllowVolumeExpansion: <unset>
MountOptions: <none>
ReclaimPolicy: Delete
VolumeBindingMode: Immediate
Events: <none>
@ryanmorris708
I got the bug, its nothing to do with deleteOrphanPVC feature nor finalizer.
The operator tries to remove unused resources and somehow pvc is getting caught in this. Have sent a fix here https://github.com/druid-io/druid-operator/pull/187/files
cc @himanshug