OOMKilled Error

Question

OOMKilled Error

tjandy98 opened this issue 2 years ago · comments

When running a ServingRuntime with 2 replicas, models are not unloaded/loaded correctly leading to OOMKilled error.
In this example, model size is ~891.7mb. 4 InferenceServices were created incrementally. As described in the example below (no. of models -> memory usage), not all models were loaded in both replicas. Since Pod 2 only had 1 model loaded and there was sufficient memory available, the fourth model was expected to be loaded in Pod 2. However that was not the case.

Pod 1:
2 models -> 3.1GiB
3 models -> 4GiB
4 models -> crash

Pod 2:
Only 1 copy of model loaded

mlserver-adapter container logs:


2022-12-16T15:05:38+08:00 1.6711743387692351e+09	INFO	MLServer Adapter.MLServer Adapter Server.Load Model	Setting 'SizeInBytes' to a multiple of model disk size	{"model_id": "de-id-55__isvc-60ac5fc3c4", "SizeInBytes": 1118056985, "disk_size": 894445588, "multiplier": 1.25}
2022-12-16T15:05:38+08:00 1.6711743387692757e+09	INFO	MLServer Adapter.MLServer Adapter Server.Load Model	MLServer model loaded	{"model_id": "de-id-55__isvc-60ac5fc3c4"}
2022-12-16T15:08:00+08:00 1.6711744809072714e+09	INFO	MLServer Adapter.MLServer Adapter Server.Load Model	Using model type	{"model_id": "de-id-66__isvc-60ac5fc3c4", "model_type": "custom"}
2022-12-16T15:08:00+08:00 1.6711744809073634e+09	DEBUG	MLServer Adapter.MLServer Adapter Server	Reading storage credentials
2022-12-16T15:08:00+08:00 1.6711744809074411e+09	DEBUG	MLServer Adapter.MLServer Adapter Server	found existing client in cache	{"type": "s3", "cacheKey": "s3|0x2989f8038a8de21e"}
2022-12-16T15:08:00+08:00 1.6711744809523559e+09	DEBUG	MLServer Adapter.MLServer Adapter Server	found objects to download	{"type": "s3", "cacheKey": "s3|0x2989f8038a8de21e", "path": "example/model-2/", "count": 7}
2022-12-16T15:08:00+08:00 1.6711744809525683e+09	DEBUG	MLServer Adapter.MLServer Adapter Server	downloading object	{"type": "s3", "cacheKey": "s3|0x2989f8038a8de21e", "path": "example/model-2/classifier_pipeline.pkl", "filename": "/models/de-id-66__isvc-60ac5fc3c4/model-2/classifier_pipeline.pkl"}
2022-12-16T15:08:00+08:00 1.6711744809526064e+09	DEBUG	MLServer Adapter.MLServer Adapter Server	downloading object	{"type": "s3", "cacheKey": "s3|0x2989f8038a8de21e", "path": "example/model-2/config.json", "filename": "/models/de-id-66__isvc-60ac5fc3c4/model-2/config.json"}
2022-12-16T15:08:00+08:00 1.6711744809526336e+09	DEBUG	MLServer Adapter.MLServer Adapter Server	downloading object	{"type": "s3", "cacheKey": "s3|0x2989f8038a8de21e", "path": "example/model-2/model-settings.json", "filename": "/models/de-id-66__isvc-60ac5fc3c4/model-2/model-settings.json"}
2022-12-16T15:08:00+08:00 1.6711744809526622e+09	DEBUG	MLServer Adapter.MLServer Adapter Server	downloading object	{"type": "s3", "cacheKey": "s3|0x2989f8038a8de21e", "path": "example/model-2/pytorch_model.bin", "filename": "/models/de-id-66__isvc-60ac5fc3c4/model-2/pytorch_model.bin"}
2022-12-16T15:08:00+08:00 1.671174480952697e+09	DEBUG	MLServer Adapter.MLServer Adapter Server	downloading object	{"type": "s3", "cacheKey": "s3|0x2989f8038a8de21e", "path": "example/model-2/special_tokens_map.json", "filename": "/models/de-id-66__isvc-60ac5fc3c4/model-2/special_tokens_map.json"}
2022-12-16T15:08:00+08:00 1.6711744809527254e+09	DEBUG	MLServer Adapter.MLServer Adapter Server	downloading object	{"type": "s3", "cacheKey": "s3|0x2989f8038a8de21e", "path": "example/model-2/tokenizer.json", "filename": "/models/de-id-66__isvc-60ac5fc3c4/model-2/tokenizer.json"}
2022-12-16T15:08:00+08:00 1.6711744809527464e+09	DEBUG	MLServer Adapter.MLServer Adapter Server	downloading object	{"type": "s3", "cacheKey": "s3|0x2989f8038a8de21e", "path": "example/model-2/tokenizer_config.json", "filename": "/models/de-id-66__isvc-60ac5fc3c4/model-2/tokenizer_config.json"}
2022-12-16T15:08:36+08:00 1.6711745169826386e+09	ERROR	MLServer Adapter.MLServer Adapter Server.Load Model	MLServer failed to load model	{"model_id": "de-id-66__isvc-60ac5fc3c4", "error": "rpc error: code = Unavailable desc = error reading from server: EOF"}
2022-12-16T15:08:36+08:00 github.com/kserve/modelmesh-runtime-adapter/internal/proto/mmesh._ModelRuntime_LoadModel_Handler
2022-12-16T15:08:36+08:00 	/opt/app/internal/proto/mmesh/model-runtime_grpc.pb.go:181
2022-12-16T15:08:36+08:00 google.golang.org/grpc.(*Server).processUnaryRPC
2022-12-16T15:08:36+08:00 	/root/go/pkg/mod/google.golang.org/grpc@v1.45.0/server.go:1282
2022-12-16T15:08:36+08:00 google.golang.org/grpc.(*Server).handleStream
2022-12-16T15:08:36+08:00 	/root/go/pkg/mod/google.golang.org/grpc@v1.45.0/server.go:1619
2022-12-16T15:08:36+08:00 google.golang.org/grpc.(*Server).serveStreams.func1.2
2022-12-16T15:08:36+08:00 	/root/go/pkg/mod/google.golang.org/grpc@v1.45.0/server.go:921
2022-12-16T15:08:38+08:00 1.6711745180047865e+09	ERROR	MLServer Adapter.MLServer Adapter Server	Failed to unload model from MLServer	{"model_id": "de-id-66__isvc-60ac5fc3c4", "error": "rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp [::1]:8001: connect: connection refused\""}
2022-12-16T15:08:38+08:00 github.com/kserve/modelmesh-runtime-adapter/internal/proto/mmesh._ModelRuntime_UnloadModel_Handler
2022-12-16T15:08:38+08:00 	/opt/app/internal/proto/mmesh/model-runtime_grpc.pb.go:199
2022-12-16T15:08:38+08:00 google.golang.org/grpc.(*Server).processUnaryRPC
2022-12-16T15:08:38+08:00 	/root/go/pkg/mod/google.golang.org/grpc@v1.45.0/server.go:1282
2022-12-16T15:08:38+08:00 google.golang.org/grpc.(*Server).handleStream
2022-12-16T15:08:38+08:00 	/root/go/pkg/mod/google.golang.org/grpc@v1.45.0/server.go:1619
2022-12-16T15:08:38+08:00 google.golang.org/grpc.(*Server).serveStreams.func1.2
2022-12-16T15:08:38+08:00 	/root/go/pkg/mod/google.golang.org/grpc@v1.45.0/server.go:921
2022-12-16T15:08:39+08:00 1.6711745190083663e+09	ERROR	MLServer Adapter.MLServer Adapter Server	Failed to unload model from MLServer	{"model_id": "de-id-66__isvc-60ac5fc3c4", "error": "rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp [::1]:8001: connect: connection refused\""}
2022-12-16T15:08:39+08:00 github.com/kserve/modelmesh-runtime-adapter/internal/proto/mmesh._ModelRuntime_UnloadModel_Handler
2022-12-16T15:08:39+08:00 	/opt/app/internal/proto/mmesh/model-runtime_grpc.pb.go:199
2022-12-16T15:08:39+08:00 google.golang.org/grpc.(*Server).processUnaryRPC
2022-12-16T15:08:39+08:00 	/root/go/pkg/mod/google.golang.org/grpc@v1.45.0/server.go:1282
2022-12-16T15:08:39+08:00 google.golang.org/grpc.(*Server).handleStream
2022-12-16T15:08:39+08:00 	/root/go/pkg/mod/google.golang.org/grpc@v1.45.0/server.go:1619
2022-12-16T15:08:39+08:00 google.golang.org/grpc.(*Server).serveStreams.func1.2
2022-12-16T15:08:39+08:00 	/root/go/pkg/mod/google.golang.org/grpc@v1.45.0/server.go:921
2022-12-16T15:08:49+08:00 1.6711745290097501e+09	INFO	MLServer Adapter.MLServer Adapter Server	Unload request for model not found in MLServer	{"error": "rpc error: code = NotFound desc = Model de-id-66__isvc-60ac5fc3c4 not found", "model_id": "de-id-66__isvc-60ac5fc3c4"}

mm container logs

2022-12-16T15:05:05+08:00 {"instant":{"epochSecond":1671174305,"nanoOfSecond":669210223},"thread":"model-load-de-id-55__isvc-60ac5fc3c4","level":"INFO","loggerName":"com.ibm.watson.modelmesh.ModelMesh","message":"Starting load for model de-id-55__isvc-60ac5fc3c4 type=rt:dep-host-11","endOfBatch":false,"loggerFqcn":"org.apache.logging.log4j.spi.AbstractLogger","contextMap":{},"threadId":51,"threadPriority":5}
2022-12-16T15:05:05+08:00 {"instant":{"epochSecond":1671174305,"nanoOfSecond":674346508},"thread":"mm-task-thread-9","level":"INFO","loggerName":"com.ibm.watson.modelmesh.ModelMesh","message":"Published new instance record: InstanceRecord [lruTime=1671169768301 (76 minutes ago), count=3, capacity=575816, used=273087 (47%), loc=10.250.94.161, zone=<none>, labels=[mt:custom, mt:custom:1, rt:dep-host-11], startTime=1671173325261 (16 minutes ago), vers=0, loadThreads=1, loadInProg=1, reqsPerMin=0]","endOfBatch":false,"loggerFqcn":"org.apache.logging.log4j.spi.AbstractLogger","contextMap":{},"threadId":48,"threadPriority":5}
2022-12-16T15:05:38+08:00 {"instant":{"epochSecond":1671174338,"nanoOfSecond":770227346},"thread":"model-load-de-id-55__isvc-60ac5fc3c4","level":"INFO","loggerName":"com.ibm.watson.modelmesh.ModelMesh","message":"Load of model de-id-55__isvc-60ac5fc3c4 type=rt:dep-host-11 completed in 33100ms","endOfBatch":false,"loggerFqcn":"org.apache.logging.log4j.spi.AbstractLogger","contextMap":{},"threadId":51,"threadPriority":5}
2022-12-16T15:05:38+08:00 {"instant":{"epochSecond":1671174338,"nanoOfSecond":770733818},"thread":"model-load-de-id-55__isvc-60ac5fc3c4","level":"INFO","loggerName":"com.ibm.watson.modelmesh.ModelMesh","message":"Model de-id-55__isvc-60ac5fc3c4 loading completed successfully","endOfBatch":false,"loggerFqcn":"org.apache.logging.log4j.spi.AbstractLogger","contextMap":{},"threadId":51,"threadPriority":5}
2022-12-16T15:05:38+08:00 {"instant":{"epochSecond":1671174338,"nanoOfSecond":771008980},"thread":"model-load-de-id-55__isvc-60ac5fc3c4","level":"INFO","loggerName":"com.ibm.watson.modelmesh.ModelMesh","message":"Model de-id-55__isvc-60ac5fc3c4 size = 136482 units, ~1066MiB","endOfBatch":false,"loggerFqcn":"org.apache.logging.log4j.spi.AbstractLogger","contextMap":{},"threadId":51,"threadPriority":5}
2022-12-16T15:05:38+08:00 {"instant":{"epochSecond":1671174338,"nanoOfSecond":778074687},"thread":"mm-task-thread-9","level":"INFO","loggerName":"com.ibm.watson.modelmesh.ModelMesh","message":"Published new instance record: InstanceRecord [lruTime=1671169768301 (76 minutes ago), count=3, capacity=575816, used=409446 (71%), loc=10.250.94.161, zone=<none>, labels=[mt:custom, mt:custom:1, rt:dep-host-11], startTime=1671173325261 (17 minutes ago), vers=0, loadThreads=1, loadInProg=0, reqsPerMin=0]","endOfBatch":false,"loggerFqcn":"org.apache.logging.log4j.spi.AbstractLogger","contextMap":{},"threadId":48,"threadPriority":5}
2022-12-16T15:06:15+08:00 {"instant":{"epochSecond":1671174375,"nanoOfSecond":3748926},"thread":"invoke-ex-de-id-55__isvc-60ac5fc3c4","level":"INFO","loggerName":"com.ibm.watson.modelmesh.ModelMesh","message":"Attempting to add second copy of model de-id-55__isvc-60ac5fc3c4 in another instance since it is in use","endOfBatch":false,"loggerFqcn":"org.apache.logging.log4j.spi.AbstractLogger","contextMap":{},"threadId":57,"threadPriority":5}
2022-12-16T15:06:51+08:00 {"instant":{"epochSecond":1671174411,"nanoOfSecond":773707170},"thread":"janitor-task","level":"INFO","loggerName":"com.ibm.watson.modelmesh.ModelMesh","message":"Janitor cache pruning task took 0ms for 3 entries","endOfBatch":false,"loggerFqcn":"org.apache.logging.log4j.spi.AbstractLogger","contextMap":{},"threadId":47,"threadPriority":5}
2022-12-16T15:06:51+08:00 {"instant":{"epochSecond":1671174411,"nanoOfSecond":774445128},"thread":"janitor-task","level":"INFO","loggerName":"com.ibm.watson.modelmesh.ModelMesh","message":"Janitor registry pruning task took 0ms for 3/8 entries","endOfBatch":false,"loggerFqcn":"org.apache.logging.log4j.spi.AbstractLogger","contextMap":{},"threadId":47,"threadPriority":5}
2022-12-16T15:08:00+08:00 {"instant":{"epochSecond":1671174480,"nanoOfSecond":899406533},"thread":"mmesh-req-thread-3","level":"INFO","loggerName":"com.ibm.watson.modelmesh.VModelManager","message":"Added new VModel de-id-66 pointing to model de-id-66__isvc-60ac5fc3c4","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":56,"threadPriority":5}
2022-12-16T15:08:00+08:00 {"instant":{"epochSecond":1671174480,"nanoOfSecond":905774032},"thread":"invoke-ex-de-id-66__isvc-60ac5fc3c4","level":"INFO","loggerName":"com.ibm.watson.modelmesh.ModelMesh","message":"About to enqueue load for model de-id-66__isvc-60ac5fc3c4 with initial weight 123 units (~1MiB), with priority 1671170880884","endOfBatch":false,"loggerFqcn":"org.apache.logging.log4j.spi.AbstractLogger","contextMap":{},"threadId":56,"threadPriority":5}
2022-12-16T15:08:00+08:00 {"instant":{"epochSecond":1671174480,"nanoOfSecond":906271861},"thread":"model-load-de-id-66__isvc-60ac5fc3c4","level":"INFO","loggerName":"com.ibm.watson.modelmesh.ModelMesh","message":"Starting load for model de-id-66__isvc-60ac5fc3c4 type=rt:dep-host-11","endOfBatch":false,"loggerFqcn":"org.apache.logging.log4j.spi.AbstractLogger","contextMap":{},"threadId":51,"threadPriority":5}
2022-12-16T15:08:00+08:00 {"instant":{"epochSecond":1671174480,"nanoOfSecond":911276807},"thread":"mm-task-thread-7","level":"INFO","loggerName":"com.ibm.watson.modelmesh.ModelMesh","message":"Published new instance record: InstanceRecord [lruTime=1671169768301 (79 minutes ago), count=4, capacity=575816, used=409569 (71%), loc=10.250.94.161, zone=<none>, labels=[mt:custom, mt:custom:1, rt:dep-host-11], startTime=1671173325261 (19 minutes ago), vers=0, loadThreads=1, loadInProg=1, reqsPerMin=0]","endOfBatch":false,"loggerFqcn":"org.apache.logging.log4j.spi.AbstractLogger","contextMap":{},"threadId":46,"threadPriority":5}
2022-12-16T15:08:36+08:00 {"instant":{"epochSecond":1671174516,"nanoOfSecond":985824480},"thread":"model-load-de-id-66__isvc-60ac5fc3c4","level":"ERROR","loggerName":"com.ibm.watson.modelmesh.ModelMesh","message":"Load failed for model de-id-66__isvc-60ac5fc3c4 type=rt:dep-host-11 after 36078ms","endOfBatch":false,"loggerFqcn":"org.apache.logging.log4j.spi.AbstractLogger","contextMap":{},"threadId":51,"threadPriority":5}
2022-12-16T15:08:37+08:00 {"instant":{"epochSecond":1671174516,"nanoOfSecond":986546840},"thread":"model-load-de-id-66__isvc-60ac5fc3c4","level":"ERROR","loggerName":"com.ibm.watson.modelmesh.ModelMesh","message":"Model load failed for de-id-66__isvc-60ac5fc3c4 type=rt:dep-host-11","thrown":{"commonElementCount":0,"localizedMessage":"UNAVAILABLE: Failed to load Model due to MLServer runtime error: rpc error: code = Unavailable desc = error reading from server: EOF","message":"UNAVAILABLE: Failed to load Model due to MLServer runtime error: rpc error: code = Unavailable desc = error reading from server: EOF","name":"io.grpc.StatusRuntimeException","extendedStackTrace":[{"class":"io.grpc.Status","method":"asRuntimeException","file":"Status.java","line":535,"exact":false,"location":"grpc-api-1.46.0.jar","version":"1.46.0"},{"class":"io.grpc.stub.ClientCalls$UnaryStreamToFuture","method":"onClose","file":"ClientCalls.java","line":542,"exact":false,"location":"grpc-stub-1.46.0.jar","version":"1.46.0"},{"class":"io.grpc.internal.ClientCallImpl","method":"closeObserver","file":"ClientCallImpl.java","line":562,"exact":false,"location":"grpc-core-1.46.0.jar","version":"1.46.0"},{"class":"io.grpc.internal.ClientCallImpl","method":"access$300","file":"ClientCallImpl.java","line":70,"exact":false,"location":"grpc-core-1.46.0.jar","version":"1.46.0"},{"class":"io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed","method":"runInternal","file":"ClientCallImpl.java","line":743,"exact":false,"location":"grpc-core-1.46.0.jar","version":"1.46.0"},{"class":"io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed","method":"runInContext","file":"ClientCallImpl.java","line":722,"exact":false,"location":"grpc-core-1.46.0.jar","version":"1.46.0"},{"class":"io.grpc.internal.ContextRunnable","method":"run","file":"ContextRunnable.java","line":37,"exact":false,"location":"grpc-core-1.46.0.jar","version":"1.46.0"},{"class":"io.grpc.internal.SerializingExecutor","method":"run","file":"SerializingExecutor.java","line":133,"exact":false,"location":"grpc-core-1.46.0.jar","version":"1.46.0"},{"class":"java.util.concurrent.ThreadPoolExecutor","method":"runWorker","file":"ThreadPoolExecutor.java","line":1136,"exact":false,"location":"?","version":"?"},{"class":"java.util.concurrent.ThreadPoolExecutor$Worker","method":"run","file":"ThreadPoolExecutor.java","line":635,"exact":false,"location":"?","version":"?"},{"class":"java.lang.Thread","method":"run","file":"Thread.java","line":833,"exact":false,"location":"?","version":"?"}]},"endOfBatch":false,"loggerFqcn":"org.apache.logging.log4j.spi.AbstractLogger","contextMap":{},"threadId":51,"threadPriority":5}
2022-12-16T15:08:37+08:00 {"instant":{"epochSecond":1671174517,"nanoOfSecond":2424372},"thread":"model-load-de-id-66__isvc-60ac5fc3c4","level":"INFO","loggerName":"com.ibm.watson.modelmesh.ModelMesh","message":"Delaying unload of model de-id-66__isvc-60ac5fc3c4 for at least 1000ms","endOfBatch":false,"loggerFqcn":"org.apache.logging.log4j.spi.AbstractLogger","contextMap":{},"threadId":51,"threadPriority":5}
2022-12-16T15:08:37+08:00 {"instant":{"epochSecond":1671174517,"nanoOfSecond":11694103},"thread":"model-load-de-id-66__isvc-60ac5fc3c4","level":"INFO","loggerName":"com.ibm.watson.modelmesh.ModelMesh","message":"ModelRecord updated for de-id-66__isvc-60ac5fc3c4: locations={}, failures={458b8f-m2cgf=1671174516986}","endOfBatch":false,"loggerFqcn":"org.apache.logging.log4j.spi.AbstractLogger","contextMap":{},"threadId":51,"threadPriority":5}
2022-12-16T15:08:38+08:00 {"instant":{"epochSecond":1671174518,"nanoOfSecond":3067017},"thread":"mm-task-thread-3","level":"INFO","loggerName":"com.ibm.watson.modelmesh.ModelMesh","message":"Initiating unload of model de-id-66__isvc-60ac5fc3c4","endOfBatch":false,"loggerFqcn":"org.apache.logging.log4j.spi.AbstractLogger","contextMap":{},"threadId":36,"threadPriority":5}
2022-12-16T15:08:38+08:00 {"instant":{"epochSecond":1671174518,"nanoOfSecond":6060163},"thread":"grpc-default-executor-7","level":"WARN","loggerName":"com.ibm.watson.modelmesh.SidecarModelMesh","message":"Unload of model de-id-66__isvc-60ac5fc3c4 failed, queueing 90 more time(s) for retry: UNAVAILABLE: Failed to unload model from MLServer","endOfBatch":false,"loggerFqcn":"org.apache.logging.log4j.spi.AbstractLogger","contextMap":{},"threadId":81,"threadPriority":5}
2022-12-16T15:08:49+08:00 {"instant":{"epochSecond":1671174529,"nanoOfSecond":370139037},"thread":"grpc-default-executor-7","level":"INFO","loggerName":"com.ibm.watson.modelmesh.SidecarModelMesh","message":"Model de-id-66__isvc-60ac5fc3c4 unloaded successfully following retry","endOfBatch":false,"loggerFqcn":"org.apache.logging.log4j.spi.AbstractLogger","contextMap":{},"threadId":81,"threadPriority":5}
2022-12-16T15:08:49+08:00 {"instant":{"epochSecond":1671174529,"nanoOfSecond":370627156},"thread":"grpc-default-executor-7","level":"INFO","loggerName":"com.ibm.watson.modelmesh.SidecarModelMesh","message":"Model de-id-66__isvc-60ac5fc3c4 unloaded successfully following retry","endOfBatch":false,"loggerFqcn":"org.apache.logging.log4j.spi.AbstractLogger","contextMap":{},"threadId":81,"threadPriority":5}
2022-12-16T15:08:49+08:00 {"instant":{"epochSecond":1671174529,"nanoOfSecond":370910525},"thread":"grpc-default-executor-7","level":"INFO","loggerName":"com.ibm.watson.modelmesh.ModelMesh","message":"Unload of de-id-66__isvc-60ac5fc3c4 completed in 11367ms","endOfBatch":false,"loggerFqcn":"org.apache.logging.log4j.spi.AbstractLogger","contextMap":{},"threadId":81,"threadPriority":5}

Version 0.9.0

Lize Cai · Answer 1 · Fri Dec 16 2022 14:42:55 GMT+0800 (China Standard Time)

Hi @njhill , can you help to take a look of this issue, we are evaluating the cache management behaviour for ModelMesh.
We are expecting the models will be distributed among replicas. If the memory is insufficient, an old model will be off-loaded to release enough memory for the new model.

Nick Hill · Answer 2 · Sat Feb 18 2023 09:27:01 GMT+0800 (China Standard Time)

@tjandy98 @lizzzcai apologies for taking so long to respond to this.

What you're seeing looks a little unexpected and I would have to dig a bit more to see what's going on, but in general the dispersion of models is not strictly coordinated and intentionally has some randomness. It was designed for when the model size is pretty small relative to the capacity of each pod, so it evens out statistically. If you can only fit a small number of models per pod then things won't necessarily be precisely balanced and you could hit situations like the one you're observing.

All of the runtimes continually publish their stats (i.e. related to their cache and request load) and each use this combined list of stats to determine where to place models. A shortlist is determined based on a "distance metric" from the best candidate (e.g. typically one with most space but there are other factors too).. and then a pod is chosen randomly from this list. Of course if there are only two pods the list will end up as either one or two.

I hope that makes sense.

Christian Kadner · Answer 3 · Sat Jan 20 2024 11:37:07 GMT+0800 (China Standard Time)

@tjandy98 @lizzzcai I am closing this issue as stale. Feel free to reopen with more (recent) information.