Using volumeClaimTemplates result in PVCs being deleted before pods come up

Question

Using volumeClaimTemplates result in PVCs being deleted before pods come up

ryanmorris708 opened this issue 3 years ago · comments

I have attempted to deploy a namespace-scope Druid Operator and Druid CR cluster using volumeClaimTemplates for the MiddleManagers and Historicals. However, the PVCs are deleted immediately after creation -- the deletions are logged by the Operator -- and the MiddleManagers and Historicals remain in the Pending state due to the missing PVCs.

I have tried setting deleteOrphanPVC: false and/or disablePVCDeletionFinalizer: true, but these do not have any effect.

Manually provisioning the PVCs before deploying the cluster is not a good option, since the StatefulSet will not allow Pods to bind to different PVCs unless they are provisioned dynamically with the volumeClaimTemplates. Therefore, having multiple Historicals or MiddleManagers is not possible, since they would be forced to use the same segment cache, log files, tmp directory, etc.

Please see my definitions and debug output below. I have omitted the node definitions and logs except for the MiddleManager to shorten this post.

Druid Operator my-values.yaml:

env:
  DENY_LIST: ""
  RECONCILE_WAIT: "10s"
  WATCH_NAMESPACE: "druid"

watchNamespace: true

replicaCount: 1

image:
  repository: druidio/druid-operator
  pullPolicy: IfNotPresent
  tag: ""

imagePullSecrets: []
nameOverride: ""
fullnameOverride: ""

rbac:
  create: true

serviceAccount:
  create: true
  annotations: {}
  name: ""

podAnnotations: {}

podSecurityContext: {}

securityContext: {}

resources:
  requests:
    cpu: 100m
    memory: 100Mi

nodeSelector: {}

tolerations: []

affinity: {}

Druid CR my-tiny-cluster.yaml (omitted node definitions except for MM with PVC):

apiVersion: "druid.apache.org/v1alpha1"
kind: "Druid"
metadata:
  name: tiny-cluster
spec:
  image: apache/druid:0.20.1
  startScript: /druid.sh
  securityContext:
    fsGroup: 1000
    runAsUser: 1000
    runAsGroup: 1000
  services:
    - spec:
        type: ClusterIP
        clusterIP: None
  commonConfigMountPath: "/opt/druid/conf/druid/cluster/_common"
  jvm.options: |-
    -server
    -XX:MaxDirectMemorySize=10240g
    -Duser.timezone=UTC
    -Dfile.encoding=UTF-8
    -Dlog4j.debug
    -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager
    -Djava.io.tmpdir=/opt/druid/var/tmp
  log4j.config: |-
    <?xml version="1.0" encoding="UTF-8" ?>
    <Configuration status="WARN">
        <Appenders>
            <Console name="Console" target="SYSTEM_OUT">
                <PatternLayout pattern="%d{ISO8601} %p [%t] %c - %m%n"/>
            </Console>
        </Appenders>
        <Loggers>
            <Root level="info">
                <AppenderRef ref="Console"/>
            </Root>
        </Loggers>
    </Configuration>
  common.runtime.properties: |

    # Zookeeper
    druid.zk.service.host=tiny-cluster-zk-0.tiny-cluster-zk
    druid.zk.paths.base=/druid
    druid.zk.service.compress=false

    # Metadata Store
    druid.metadata.storage.type=derby
    druid.metadata.storage.connector.connectURI=jdbc:derby://localhost:1527/opt/druid/var/derbydb/metadata.db;create=true
    druid.metadata.storage.connector.host=localhost
    druid.metadata.storage.connector.port=1527
    druid.metadata.storage.connector.createTables=true

    # Deep Storage
    druid.storage.type=local
    druid.storage.storageDirectory=/opt/druid/var/deepstorage

    #
    # Extensions
    #
    druid.extensions.loadList=["druid-kafka-indexing-service"]

    #
    # Service discovery
    #
    druid.selectors.indexing.serviceName=druid/overlord
    druid.selectors.coordinator.serviceName=druid/coordinator

    druid.indexer.logs.type=file
    druid.indexer.logs.directory=/opt/druid/var/indexing-logs
    druid.lookup.enableLookupSyncOnStartup=false
  env:
    - name: POD_NAME
      valueFrom:
        fieldRef:
          fieldPath: metadata.name
    - name: POD_NAMESPACE
      valueFrom:
        fieldRef:
          fieldPath: metadata.namespace

  nodes:
    <...>

    middlemanagers:
      druid.port: 8091
      extra.jvm.options: |-
        -Xms256m
        -Xmx256m
      nodeConfigMountPath: /opt/druid/conf/druid/cluster/data/middleManager
      nodeType: middleManager
      ports:
        - containerPort: 8100
          name: peon-0
      replicas: 1
      resources:
        requests:
          cpu: 100m
          memory: 100Mi
      runtime.properties: |-

        druid.service=druid/middleManager
        druid.server.http.numThreads=60
        druid.worker.capacity=2

        # Indexer settings
        druid.indexer.runner.javaOptsArray=["-server", "-Xms6g", "-Xmx6g", "-XX:+UseG1GC", "-XX:MaxDirectMemorySize=3g", "-Duser.timezone=UTC", "-Dfile.encoding=UTF-8", "-Djava.io.tmpdir=/opt/druid/var/tmp", "-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager"]
        druid.indexer.task.baseTaskDir=/opt/druid/var/druid/task

        # Processing threads and buffers on Peons
        druid.indexer.fork.property.druid.processing.numMergeBuffers=2
        druid.indexer.fork.property.druid.processing.buffer.sizeBytes=256000000
        druid.indexer.fork.property.druid.processing.numThreads=1

        # Peon query cache
        druid.realtime.cache.useCache=true
        druid.realtime.cache.populateCache=true
        druid.cache.sizeInBytes=256000000
      volumeClaimTemplates:
        - metadata:
            name: data-volume
          spec:
            accessModes:
              - ReadWriteOnce
            resources:
              requests:
                storage: 30Gi
            storageClassName: csi-storageclass
      volumeMounts:
        - mountPath: /opt/druid/var
          name: data-volume

Output of "kubectl -n druid get pods":

NAME                                      READY   STATUS    RESTARTS   AGE
druid-operator-nonprod-8585747989-jpmvq   1/1     Running   0          2d21h
druid-tiny-cluster-brokers-0              1/1     Running   0          3m3s
druid-tiny-cluster-coordinators-0         1/1     Running   0          3m2s
druid-tiny-cluster-historicals-0          0/1     Pending   0          3m3s
druid-tiny-cluster-middlemanagers-0       0/1     Pending   0          3m3s
druid-tiny-cluster-routers-0              1/1     Running   0          3m2s
tiny-cluster-zk-0                         1/1     Running   0          24m

Output of "kubectl -n druid get pvc":

No resources found in druid namespace.

Output of "kubectl -n druid logs druid-operator-nonprod-8585747989-jpmvq" (omitted node logs except for MM with PVC):

2021-05-14T17:56:03.967Z        DEBUG   controller-runtime.controller   Successfully Reconciled {"controller": "druid", "request": "druid/nonprod"}
2021-05-14T17:56:09.833Z        DEBUG   controller-runtime.controller   Successfully Reconciled {"controller": "druid", "request": "druid/nonprod"}
2021-05-14T17:57:30.013Z        INFO    druid_operator_handler  Created [:tiny-cluster-druid-common-config].    {"Object": <...>}
2021-05-14T17:57:30.046Z        INFO    druid_operator_handler  Updated [Druid:tiny-cluster].   {"Prev Object": <...>}
<...>
2021-05-14T17:57:30.127Z        INFO    druid_operator_handler  Created [:druid-tiny-cluster-middlemanagers-config].            {"Object": <...>}
2021-05-14T17:57:30.156Z        INFO    druid_operator_handler  Created [:druid-tiny-cluster-middlemanagers].   {"Object": <...>}
2021-05-14T17:57:30.199Z        INFO    druid_operator_handler  Created [:druid-tiny-cluster-middlemanagers].   {"Object": <...>}
<...>
2021-05-14T17:57:30.520Z        INFO    druid_operator_handler  Successfully deleted object [PersistentVolumeClaim] in namespace [druid]        {"name": "tiny-cluster", "namespace": "druid"}
2021-05-14T17:57:30.599Z        INFO    druid_operator_handler  Updated [StatefulSet:druid-tiny-cluster-middlemanagers].        {"Prev Object": <...>}
<...>

AdheipSingh · Answer 1 · Sat May 15 2021 06:10:57 GMT+0800 (China Standard Time)

@ryanmorris708 can you confirm which druid operator version are you using ?

Ryan Morris · Answer 2 · Sat May 15 2021 06:14:49 GMT+0800 (China Standard Time)

Here is my chart.yaml for the Operator:

apiVersion: v2
name: druid-operator
description: Druid Kubernetes Operator

# A chart can be either an 'application' or a 'library' chart.
#
# Application charts are a collection of templates that can be packaged into versioned archives
# to be deployed.
#
# Library charts provide useful utilities or functions for the chart developer. They're included as
# a dependency of application charts to inject those utilities and functions into the rendering
# pipeline. Library charts do not define any templates and therefore cannot be deployed.
type: application

# This is the chart version. This version number should be incremented each time you make changes
# to the chart and its templates, including the app version.
# Versions are expected to follow Semantic Versioning (https://semver.org/)
version: 0.1.1

# This is the version number of the application being deployed. This version number should be
# incremented each time you make changes to the application. Versions are not expected to
# follow Semantic Versioning. They should reflect the version the application is using.
appVersion: 0.0.6

AdheipSingh · Answer 3 · Sat May 15 2021 06:24:33 GMT+0800 (China Standard Time)

@ryanmorris708 can you send me your storage class yaml please ?

AdheipSingh · Answer 4 · Sat May 15 2021 06:25:41 GMT+0800 (China Standard Time)

can you try adding this parameter to your storageclass volumeBindingMode: WaitForFirstConsumer and re-create the issue

Ryan Morris · Answer 5 · Sat May 15 2021 07:09:50 GMT+0800 (China Standard Time)

The volumeBindingMode is immutable and this StorageClass is in use by other deployments, so I have created an identical StorageClass called csi-storageclass-2 with just the volumeBindingMode differing.

I observed the same behavior where the PVCs were deleted immediately.

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: csi-storageclass-2
  annotations:
    storageclass.kubernetes.io/is-default-class: "false"
provisioner: csi.vsphere.vmware.com
parameters:
  storagepolicyname: Default-Storage
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer

Output of "kubectl describe storageclass csi-storageclass" for comparison (the StorageClass that I used for my initial post):

Name:                  csi-storageclass
IsDefaultClass:        Yes
Annotations:           storageclass.kubernetes.io/is-default-class=true
Provisioner:           csi.vsphere.vmware.com
Parameters:            storagepolicyname=Default-Storage
AllowVolumeExpansion:  <unset>
MountOptions:          <none>
ReclaimPolicy:         Delete
VolumeBindingMode:     Immediate
Events:                <none>

AdheipSingh · Answer 6 · Sat May 15 2021 07:31:14 GMT+0800 (China Standard Time)

@ryanmorris708
I got the bug, its nothing to do with deleteOrphanPVC feature nor finalizer.

The operator tries to remove unused resources and somehow pvc is getting caught in this. Have sent a fix here https://github.com/druid-io/druid-operator/pull/187/files

cc @himanshug