Error when enabling TLS for ModelMesh
lizzzcai opened this issue · comments
Hi, I am follow this doc to enable the TLS for ModelMesh.
I have enabled rest-proxy
and the rest-proxy enable TLS successfully, so the TLS secret should be mounted correctly into the Pod. The error is coming from the mm
conatiner.
Error logs from mm
container:
❯ k logs modelmesh-serving-custom-mlserver-1.x-6755949cc9-mvr7j mm
Running as uid=2000(app) gid=2000(app) groups=2000(app): app
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
LD_LIBRARY_PATH=
PKG_CONFIG_PATH=
XDG_DATA_DIRS=
Running in Kubernetes
Registering internal (pod) endpoint only
PORT_ARGS=-p 8080 -r 8080
WATSON_SERVICE_ADDRESS=10.244.0.73
Registering instance id (from Kubernetes pod name) as "949cc9-mvr7j"
WARNING: MM_SVC_GRPC_PRIVATE_KEY_PATH not set *AND/OR* MM_ENABLE_SSL=false, using PLAINTEXT for internal comms
SERVICE_VERSION=
build-version=20220721-36830
Registering ModelMesh Service version as "20220721-36830"
JAVA_HOME set to /usr/lib/jvm/jre-17-openjdk
Java version information:
openjdk version "17.0.3" 2022-04-19 LTS
OpenJDK Runtime Environment 21.9 (build 17.0.3+7-LTS)
OpenJDK 64-Bit Server VM 21.9 (build 17.0.3+7-LTS, mixed mode, sharing)
KV_STORE=etcd:/opt/kserve/mmesh/etcd/etcd_connection
LL_REGISTRY=
ZOOKEEPER=
WATSON_SERVICE_ADDRESS=10.244.0.73
MM_SERVICE_NAME=modelmesh-serving
PRIVATE_ENDPOINT=
MM_LOCATION=172.18.0.2
MM_SERVICE_CLASS=com.ibm.watson.modelmesh.SidecarModelMesh
INTERNAL_GRPC_PORT=8085
SERVICE_ARGS=-p 8080 -r 8080 -i 949cc9-mvr7j -v 20220721-36830
Certificate was added to keystore
Imported provided CA cert into litelinks truststore: /opt/kserve/mmesh/tls/tls.crt
Using provided private key for litelinks (internal) TLS: /opt/kserve/mmesh/tls/tls.key
Certificate was added to keystore
Imported Kubernetes CA certificate into litelinks truststore: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
No process mem limit provided or found, defaulting to 1536MiB
MEM_LIMIT_MB=1536
Using default heap size of MIN(41% of MEM_LIMIT_MB, 640MiB) = 629MiB
HEAP_SIZE_MB=629
MEM_HEADROOM_MB=189
MAX_GC_PAUSE=50 millisecs
MAX_DIRECT_BUFS_MB=715
Health probe HTTP endpoint will use port 8089
SHUTDOWN_TIMEOUT_MS=90000
+ exec /usr/lib/jvm/jre-17-openjdk/bin/java -cp 'lib/litelinks-core-1.7.2.jar:lib/*' -XX:+UnlockExperimentalVMOptions -XX:+UseG1GC -XX:MaxGCPauseMillis=50 -XX:-ResizePLAB -Xmx629m -Xms629m -XX:MaxInlineLevel=28 -Xlog:gc:/opt/kserve/mmesh/log/vgc_modelmesh-serving-custom-mlserver-1.x-6755949cc9-mvr7j.log -Dfile.encoding=UTF8 -Dio.netty.tryReflectionSetAccessible=true --add-opens=java.base/java.nio=ALL-UNNAMED -Dcom.redhat.fips=false -XX:MaxDirectMemorySize=33554432 -Dio.netty.maxDirectMemory=749731840 -Dio.grpc.netty.useCustomAllocator=false -Dlitelinks.ssl.key.path=/opt/kserve/mmesh/tls/tls.key -Dlitelinks.ssl.key.certpath=/opt/kserve/mmesh/tls/tls.crt -Dwatson.ssl.truststore.path=/opt/kserve/mmesh/lib/truststore.jks -Dwatson.ssl.truststore.password=watson15qa -Dlitelinks.cancel_on_client_close=true -Dlitelinks.threadcontexts=log_mdc -Dlitelinks.shutdown_timeout_ms=90000 -Dlitelinks.produce_pooled_bytebufs=true -Dlitelinks.ssl.use_jdk=false -Dlog4j.configurationFile=/opt/kserve/mmesh/lib/log4j2.xml -Dlog4j2.enable.threadlocals=true -Dlog4j2.garbagefree.threadContextMap=true com.ibm.watson.litelinks.server.LitelinksService -s com.ibm.watson.modelmesh.SidecarModelMesh -n modelmesh-serving -a /opt/kserve/mmesh/model-mesh.anchor -p 8080 -r 8080 -i 949cc9-mvr7j -v 20220721-36830 -h 8089
using service registry string: etcd:/opt/kserve/mmesh/etcd/etcd_connection
{"instant":{"epochSecond":1665134888,"nanoOfSecond":351742124},"thread":"main","level":"INFO","loggerName":"com.ibm.watson.litelinks.server.WatchedService","message":"Starting service-watching wrapper; hostname=10.244.0.73","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":1,"threadPriority":5}
{"instant":{"epochSecond":1665134888,"nanoOfSecond":449461245},"thread":"main","level":"INFO","loggerName":"com.ibm.watson.litelinks.server.ProbeHttpServer","message":"Starting litelinks health probe http server on port 8089","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":1,"threadPriority":5}
{"instant":{"epochSecond":1665134888,"nanoOfSecond":455341678},"thread":"main","level":"INFO","loggerName":"com.ibm.watson.litelinks.NettyCommon","message":"Litelinks using native transport (epoll)","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":1,"threadPriority":5}
{"instant":{"epochSecond":1665134888,"nanoOfSecond":480884593},"thread":"main","level":"INFO","loggerName":"com.ibm.watson.litelinks.NettyCommon","message":"Creating litelinks shared worker ELG with 3 threads","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":1,"threadPriority":5}
{"instant":{"epochSecond":1665134888,"nanoOfSecond":524638590},"thread":"ll-svc-events-3","level":"INFO","loggerName":"com.ibm.watson.litelinks.server.DefaultThriftServer","message":"initializing service implementation...","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":20,"threadPriority":5}
service starting, type "stop" to stop
{"instant":{"epochSecond":1665134888,"nanoOfSecond":751308930},"thread":"ll-svc-events-3","level":"INFO","loggerName":"com.ibm.watson.kvutils.factory.KVUtilsFactory","message":"KV_STORE=etcd:/opt/kserve/mmesh/etcd/etcd_connection","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":20,"threadPriority":5}
{"instant":{"epochSecond":1665134888,"nanoOfSecond":752015442},"thread":"ll-svc-events-3","level":"INFO","loggerName":"com.ibm.watson.kvutils.factory.KVUtilsFactory","message":"creating new etcd KV factory with config file: /opt/kserve/mmesh/etcd/etcd_connection","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":20,"threadPriority":5}
service failed, exiting
java.lang.IllegalStateException: Expected the service modelmesh-serving to be RUNNING, but the service has FAILED
at com.google.common.util.concurrent.AbstractService.checkCurrentState(AbstractService.java:381)
at com.google.common.util.concurrent.AbstractService.awaitRunning(AbstractService.java:321)
at com.ibm.watson.litelinks.server.LitelinksService.run(LitelinksService.java:699)
at com.ibm.watson.litelinks.server.LitelinksService.launch(LitelinksService.java:144)
at com.ibm.watson.litelinks.server.LitelinksService.main(LitelinksService.java:108)
Caused by: java.io.IOException: Can't find certificate file: tls.key
at com.ibm.etcd.client.config.EtcdClusterConfig.certFromJson(EtcdClusterConfig.java:225)
at com.ibm.etcd.client.config.EtcdClusterConfig.fromJson(EtcdClusterConfig.java:199)
at com.ibm.etcd.client.config.EtcdClusterConfig.fromJsonFileOrSimple(EtcdClusterConfig.java:265)
at com.ibm.watson.etcd.EtcdUtilsFactory.<init>(EtcdUtilsFactory.java:58)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:77)
at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:499)
at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:480)
at com.ibm.watson.kvutils.factory.KVUtilsFactory.newFactory(KVUtilsFactory.java:99)
at com.ibm.watson.kvutils.factory.KVUtilsFactory.getDefaultFactory(KVUtilsFactory.java:60)
at com.ibm.watson.modelmesh.ModelMesh.<init>(ModelMesh.java:470)
at com.ibm.watson.modelmesh.SidecarModelMesh.<init>(SidecarModelMesh.java:147)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:77)
at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:499)
at java.base/java.lang.reflect.ReflectAccess.newInstance(ReflectAccess.java:128)
at java.base/jdk.internal.reflect.ReflectionFactory.newInstance(ReflectionFactory.java:347)
at java.base/java.lang.Class.newInstance(Class.java:645)
at com.ibm.watson.litelinks.server.DefaultThriftServer.lambda$doStart$0(DefaultThriftServer.java:384)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)
at com.ibm.watson.litelinks.ThreadPoolHelper$3$1.run(ThreadPoolHelper.java:91)
{"instant":{"epochSecond":1665134888,"nanoOfSecond":769517095},"thread":"ll-svc-events-5","level":"INFO","loggerName":"com.ibm.watson.litelinks.server.ProbeHttpServer","message":"Stopping litelinks health probe http server on port 8089","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":22,"threadPriority":5}
mm
container spec:
containers:
- env:
- name: MM_SERVICE_NAME
value: modelmesh-serving
- name: MM_SVC_GRPC_PORT
value: "8033"
- name: WKUBE_POD_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
- name: WKUBE_POD_IPADDR
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.podIP
- name: MM_LOCATION
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.hostIP
- name: KV_STORE
value: etcd:/opt/kserve/mmesh/etcd/etcd_connection
- name: MM_METRICS
value: prometheus:port=2112;scheme=https
- name: SHUTDOWN_TIMEOUT_MS
value: "90000"
- name: MM_SVC_GRPC_MAX_HEADERS_SIZE
value: "32768"
- name: INTERNAL_SERVING_GRPC_PORT
value: "8001"
- name: INTERNAL_GRPC_PORT
value: "8085"
- name: MM_SVC_GRPC_MAX_MSG_SIZE
value: "16777216"
- name: MM_KVSTORE_PREFIX
value: mm
- name: MM_DEFAULT_VMODEL_OWNER
value: ksp
- name: MM_LABELS
value: mt:custom,mt:custom:1,rt:custom-mlserver-1.x
- name: MM_TYPE_CONSTRAINTS_PATH
value: /etc/watson/mmesh/config/type_constraints
- name: MM_DATAPLANE_CONFIG_PATH
value: /etc/watson/mmesh/config/dataplane_api_config
- name: MM_TLS_KEY_CERT_PATH
value: /opt/kserve/mmesh/tls/tls.crt
- name: MM_TLS_PRIVATE_KEY_PATH
value: /opt/kserve/mmesh/tls/tls.key
image: kserve/modelmesh:v0.9.0
imagePullPolicy: IfNotPresent
lifecycle:
preStop:
exec:
command:
- /opt/kserve/mmesh/stop.sh
- wait
livenessProbe:
failureThreshold: 2
httpGet:
path: /live
port: 8089
scheme: HTTP
initialDelaySeconds: 90
periodSeconds: 30
successThreshold: 1
timeoutSeconds: 5
name: mm
ports:
- containerPort: 8033
name: grpc
protocol: TCP
- containerPort: 2112
name: prometheus
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /ready
port: 8089
scheme: HTTP
initialDelaySeconds: 5
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
cpu: "3"
memory: 448Mi
requests:
cpu: 300m
memory: 448Mi
securityContext:
capabilities:
drop:
- ALL
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /etc/watson/mmesh/config
name: tc-config
- mountPath: /opt/kserve/mmesh/etcd
name: etcd-config
readOnly: true
- mountPath: /opt/kserve/mmesh/tls
name: tls-certs
readOnly: true
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-8gc8b
readOnly: true
...
volumes:
- emptyDir:
sizeLimit: 1536Mi
name: models-dir
- name: storage-config
secret:
defaultMode: 420
secretName: storage-config
- configMap:
defaultMode: 420
name: tc-config
name: tc-config
- name: etcd-config
secret:
defaultMode: 420
secretName: model-serving-etcd
- name: tls-certs
secret:
defaultMode: 420
secretName: etcd-client-certificate
- name: kube-api-access-8gc8b
projected:
defaultMode: 420
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: token
- configMap:
items:
- key: ca.crt
path: ca.crt
name: kube-root-ca.crt
- downwardAPI:
items:
- fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
path: namespace
@lizzzcai apologies I missed this. What does the contents of your model-serving-etcd
secret look like? Based on the error it doesn't look related to enabling TLS for the modemesh service itself, rather the TLS config for connecting to etcd.
It looks like you must have "client_key_file": "tls.key"
in your etcd_connection
json but no corresponding key named tls.key
with the private key contents in the same secret (I'm assuming here you are configuring client authentication to the etcd cluster).
Hi @njhill , sorry for the late reply. I will try the mtls
setup again today and update you on this issue.