kserve / modelmesh

Distributed Model Serving Framework

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Error when enabling TLS for ModelMesh

lizzzcai opened this issue · comments

Hi, I am follow this doc to enable the TLS for ModelMesh.

I have enabled rest-proxy and the rest-proxy enable TLS successfully, so the TLS secret should be mounted correctly into the Pod. The error is coming from the mm conatiner.

Error logs from mm container:

❯ k logs modelmesh-serving-custom-mlserver-1.x-6755949cc9-mvr7j mm
Running as uid=2000(app) gid=2000(app) groups=2000(app): app
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
LD_LIBRARY_PATH=
PKG_CONFIG_PATH=
XDG_DATA_DIRS=

Running in Kubernetes
Registering internal (pod) endpoint only
PORT_ARGS=-p 8080 -r 8080
WATSON_SERVICE_ADDRESS=10.244.0.73
Registering instance id (from Kubernetes pod name) as "949cc9-mvr7j"
WARNING: MM_SVC_GRPC_PRIVATE_KEY_PATH not set *AND/OR* MM_ENABLE_SSL=false, using PLAINTEXT for internal comms
SERVICE_VERSION=
build-version=20220721-36830
Registering ModelMesh Service version as "20220721-36830"
JAVA_HOME set to /usr/lib/jvm/jre-17-openjdk
Java version information:
openjdk version "17.0.3" 2022-04-19 LTS
OpenJDK Runtime Environment 21.9 (build 17.0.3+7-LTS)
OpenJDK 64-Bit Server VM 21.9 (build 17.0.3+7-LTS, mixed mode, sharing)
KV_STORE=etcd:/opt/kserve/mmesh/etcd/etcd_connection
LL_REGISTRY=
ZOOKEEPER=
WATSON_SERVICE_ADDRESS=10.244.0.73
MM_SERVICE_NAME=modelmesh-serving
PRIVATE_ENDPOINT=
MM_LOCATION=172.18.0.2
MM_SERVICE_CLASS=com.ibm.watson.modelmesh.SidecarModelMesh
INTERNAL_GRPC_PORT=8085
SERVICE_ARGS=-p 8080 -r 8080 -i 949cc9-mvr7j -v 20220721-36830
Certificate was added to keystore
Imported provided CA cert into litelinks truststore: /opt/kserve/mmesh/tls/tls.crt
Using provided private key for litelinks (internal) TLS: /opt/kserve/mmesh/tls/tls.key
Certificate was added to keystore
Imported Kubernetes CA certificate into litelinks truststore: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
No process mem limit provided or found, defaulting to 1536MiB
MEM_LIMIT_MB=1536
Using default heap size of MIN(41% of MEM_LIMIT_MB, 640MiB) = 629MiB
HEAP_SIZE_MB=629
MEM_HEADROOM_MB=189
MAX_GC_PAUSE=50 millisecs
MAX_DIRECT_BUFS_MB=715
Health probe HTTP endpoint will use port 8089
SHUTDOWN_TIMEOUT_MS=90000
+ exec /usr/lib/jvm/jre-17-openjdk/bin/java -cp 'lib/litelinks-core-1.7.2.jar:lib/*' -XX:+UnlockExperimentalVMOptions -XX:+UseG1GC -XX:MaxGCPauseMillis=50 -XX:-ResizePLAB -Xmx629m -Xms629m -XX:MaxInlineLevel=28 -Xlog:gc:/opt/kserve/mmesh/log/vgc_modelmesh-serving-custom-mlserver-1.x-6755949cc9-mvr7j.log -Dfile.encoding=UTF8 -Dio.netty.tryReflectionSetAccessible=true --add-opens=java.base/java.nio=ALL-UNNAMED -Dcom.redhat.fips=false -XX:MaxDirectMemorySize=33554432 -Dio.netty.maxDirectMemory=749731840 -Dio.grpc.netty.useCustomAllocator=false -Dlitelinks.ssl.key.path=/opt/kserve/mmesh/tls/tls.key -Dlitelinks.ssl.key.certpath=/opt/kserve/mmesh/tls/tls.crt -Dwatson.ssl.truststore.path=/opt/kserve/mmesh/lib/truststore.jks -Dwatson.ssl.truststore.password=watson15qa -Dlitelinks.cancel_on_client_close=true -Dlitelinks.threadcontexts=log_mdc -Dlitelinks.shutdown_timeout_ms=90000 -Dlitelinks.produce_pooled_bytebufs=true -Dlitelinks.ssl.use_jdk=false -Dlog4j.configurationFile=/opt/kserve/mmesh/lib/log4j2.xml -Dlog4j2.enable.threadlocals=true -Dlog4j2.garbagefree.threadContextMap=true com.ibm.watson.litelinks.server.LitelinksService -s com.ibm.watson.modelmesh.SidecarModelMesh -n modelmesh-serving -a /opt/kserve/mmesh/model-mesh.anchor -p 8080 -r 8080 -i 949cc9-mvr7j -v 20220721-36830 -h 8089
using service registry string: etcd:/opt/kserve/mmesh/etcd/etcd_connection
{"instant":{"epochSecond":1665134888,"nanoOfSecond":351742124},"thread":"main","level":"INFO","loggerName":"com.ibm.watson.litelinks.server.WatchedService","message":"Starting service-watching wrapper; hostname=10.244.0.73","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":1,"threadPriority":5}
{"instant":{"epochSecond":1665134888,"nanoOfSecond":449461245},"thread":"main","level":"INFO","loggerName":"com.ibm.watson.litelinks.server.ProbeHttpServer","message":"Starting litelinks health probe http server on port 8089","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":1,"threadPriority":5}
{"instant":{"epochSecond":1665134888,"nanoOfSecond":455341678},"thread":"main","level":"INFO","loggerName":"com.ibm.watson.litelinks.NettyCommon","message":"Litelinks using native transport (epoll)","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":1,"threadPriority":5}
{"instant":{"epochSecond":1665134888,"nanoOfSecond":480884593},"thread":"main","level":"INFO","loggerName":"com.ibm.watson.litelinks.NettyCommon","message":"Creating litelinks shared worker ELG with 3 threads","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":1,"threadPriority":5}
{"instant":{"epochSecond":1665134888,"nanoOfSecond":524638590},"thread":"ll-svc-events-3","level":"INFO","loggerName":"com.ibm.watson.litelinks.server.DefaultThriftServer","message":"initializing service implementation...","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":20,"threadPriority":5}
service starting, type "stop" to stop
{"instant":{"epochSecond":1665134888,"nanoOfSecond":751308930},"thread":"ll-svc-events-3","level":"INFO","loggerName":"com.ibm.watson.kvutils.factory.KVUtilsFactory","message":"KV_STORE=etcd:/opt/kserve/mmesh/etcd/etcd_connection","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":20,"threadPriority":5}
{"instant":{"epochSecond":1665134888,"nanoOfSecond":752015442},"thread":"ll-svc-events-3","level":"INFO","loggerName":"com.ibm.watson.kvutils.factory.KVUtilsFactory","message":"creating new etcd KV factory with config file: /opt/kserve/mmesh/etcd/etcd_connection","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":20,"threadPriority":5}
service failed, exiting
java.lang.IllegalStateException: Expected the service modelmesh-serving to be RUNNING, but the service has FAILED
        at com.google.common.util.concurrent.AbstractService.checkCurrentState(AbstractService.java:381)
        at com.google.common.util.concurrent.AbstractService.awaitRunning(AbstractService.java:321)
        at com.ibm.watson.litelinks.server.LitelinksService.run(LitelinksService.java:699)
        at com.ibm.watson.litelinks.server.LitelinksService.launch(LitelinksService.java:144)
        at com.ibm.watson.litelinks.server.LitelinksService.main(LitelinksService.java:108)
Caused by: java.io.IOException: Can't find certificate file: tls.key
        at com.ibm.etcd.client.config.EtcdClusterConfig.certFromJson(EtcdClusterConfig.java:225)
        at com.ibm.etcd.client.config.EtcdClusterConfig.fromJson(EtcdClusterConfig.java:199)
        at com.ibm.etcd.client.config.EtcdClusterConfig.fromJsonFileOrSimple(EtcdClusterConfig.java:265)
        at com.ibm.watson.etcd.EtcdUtilsFactory.<init>(EtcdUtilsFactory.java:58)
        at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:77)
        at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:499)
        at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:480)
        at com.ibm.watson.kvutils.factory.KVUtilsFactory.newFactory(KVUtilsFactory.java:99)
        at com.ibm.watson.kvutils.factory.KVUtilsFactory.getDefaultFactory(KVUtilsFactory.java:60)
        at com.ibm.watson.modelmesh.ModelMesh.<init>(ModelMesh.java:470)
        at com.ibm.watson.modelmesh.SidecarModelMesh.<init>(SidecarModelMesh.java:147)
        at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:77)
        at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:499)
        at java.base/java.lang.reflect.ReflectAccess.newInstance(ReflectAccess.java:128)
        at java.base/jdk.internal.reflect.ReflectionFactory.newInstance(ReflectionFactory.java:347)
        at java.base/java.lang.Class.newInstance(Class.java:645)
        at com.ibm.watson.litelinks.server.DefaultThriftServer.lambda$doStart$0(DefaultThriftServer.java:384)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:833)
        at com.ibm.watson.litelinks.ThreadPoolHelper$3$1.run(ThreadPoolHelper.java:91)
{"instant":{"epochSecond":1665134888,"nanoOfSecond":769517095},"thread":"ll-svc-events-5","level":"INFO","loggerName":"com.ibm.watson.litelinks.server.ProbeHttpServer","message":"Stopping litelinks health probe http server on port 8089","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":22,"threadPriority":5}

mm container spec:

  containers:
  - env:
    - name: MM_SERVICE_NAME
      value: modelmesh-serving
    - name: MM_SVC_GRPC_PORT
      value: "8033"
    - name: WKUBE_POD_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.name
    - name: WKUBE_POD_IPADDR
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: status.podIP
    - name: MM_LOCATION
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: status.hostIP
    - name: KV_STORE
      value: etcd:/opt/kserve/mmesh/etcd/etcd_connection
    - name: MM_METRICS
      value: prometheus:port=2112;scheme=https
    - name: SHUTDOWN_TIMEOUT_MS
      value: "90000"
    - name: MM_SVC_GRPC_MAX_HEADERS_SIZE
      value: "32768"
    - name: INTERNAL_SERVING_GRPC_PORT
      value: "8001"
    - name: INTERNAL_GRPC_PORT
      value: "8085"
    - name: MM_SVC_GRPC_MAX_MSG_SIZE
      value: "16777216"
    - name: MM_KVSTORE_PREFIX
      value: mm
    - name: MM_DEFAULT_VMODEL_OWNER
      value: ksp
    - name: MM_LABELS
      value: mt:custom,mt:custom:1,rt:custom-mlserver-1.x
    - name: MM_TYPE_CONSTRAINTS_PATH
      value: /etc/watson/mmesh/config/type_constraints
    - name: MM_DATAPLANE_CONFIG_PATH
      value: /etc/watson/mmesh/config/dataplane_api_config
    - name: MM_TLS_KEY_CERT_PATH
      value: /opt/kserve/mmesh/tls/tls.crt
    - name: MM_TLS_PRIVATE_KEY_PATH
      value: /opt/kserve/mmesh/tls/tls.key
    image: kserve/modelmesh:v0.9.0
    imagePullPolicy: IfNotPresent
    lifecycle:
      preStop:
        exec:
          command:
          - /opt/kserve/mmesh/stop.sh
          - wait
    livenessProbe:
      failureThreshold: 2
      httpGet:
        path: /live
        port: 8089
        scheme: HTTP
      initialDelaySeconds: 90
      periodSeconds: 30
      successThreshold: 1
      timeoutSeconds: 5
    name: mm
    ports:
    - containerPort: 8033
      name: grpc
      protocol: TCP
    - containerPort: 2112
      name: prometheus
      protocol: TCP
    readinessProbe:
      failureThreshold: 3
      httpGet:
        path: /ready
        port: 8089
        scheme: HTTP
      initialDelaySeconds: 5
      periodSeconds: 5
      successThreshold: 1
      timeoutSeconds: 1
    resources:
      limits:
        cpu: "3"
        memory: 448Mi
      requests:
        cpu: 300m
        memory: 448Mi
    securityContext:
      capabilities:
        drop:
        - ALL
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /etc/watson/mmesh/config
      name: tc-config
    - mountPath: /opt/kserve/mmesh/etcd
      name: etcd-config
      readOnly: true
    - mountPath: /opt/kserve/mmesh/tls
      name: tls-certs
      readOnly: true
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-8gc8b
      readOnly: true

...
  volumes:
  - emptyDir:
      sizeLimit: 1536Mi
    name: models-dir
  - name: storage-config
    secret:
      defaultMode: 420
      secretName: storage-config
  - configMap:
      defaultMode: 420
      name: tc-config
    name: tc-config
  - name: etcd-config
    secret:
      defaultMode: 420
      secretName: model-serving-etcd
  - name: tls-certs
    secret:
      defaultMode: 420
      secretName: etcd-client-certificate
  - name: kube-api-access-8gc8b
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace

@lizzzcai apologies I missed this. What does the contents of your model-serving-etcd secret look like? Based on the error it doesn't look related to enabling TLS for the modemesh service itself, rather the TLS config for connecting to etcd.

It looks like you must have "client_key_file": "tls.key" in your etcd_connection json but no corresponding key named tls.key with the private key contents in the same secret (I'm assuming here you are configuring client authentication to the etcd cluster).

@lizzzcai ping :) can this be closed now?

Hi @njhill , sorry for the late reply. I will try the mtls setup again today and update you on this issue.

@lizzzcai sure, no rush at all, just checking whether the issue could be closed yet.

@lizzzcai sure, no rush at all, just checking whether the issue could be closed yet.

Hi @njhill , the issue is solved, I missed the modelmesh-serving.modelmesh-serving in the dnsNames of my certificate. Thanks for your support.