MMS Server getting stuck while registering the model and says "worker pid is not available yet"

Question

MMS Server getting stuck while registering the model and says "worker pid is not available yet"

suchith-sixsense opened this issue 10 months ago · comments

suchith-sixsense commented 10 months ago

Hello,

I am facing weird issue while running the multi-model-server in a docker image. So the setup for inference is as follows

I have two process where one process is multi model server and then the other one is listening messages from a queue. The idea is that whenever user triggers a inference, the message will be received by the queue process. This process will then download the image, registers the models in multi-model-server and then run the prediction. So communication between queue based service and mms service is through 0.0.0.0. This whole thing runs in one docker container.

I hope you understood the setup, so the problem I am facing is whenever I register models it works for 2. If we increase the number of models to 3, POST API (used for registration) doesn't respond at all.

My queue based process logs look like (after the last line nothing gets logged)

2023-08-17 19:59:33.188 | INFO     | manage:put:92 - test_v1.33 not registeted. Registering model = test_v1.33 with initial workers = 1

2023-08-17 19:59:33.188 | INFO     | manage:put:94 - Current cache = 2, max capacity = 5

2023-08-17 19:59:33.188 | INFO     | utils.api_handler:post:124 - Doing post request at api url = http://0.0.0.0:8081/models/ with payload = {'url': 'test_v1.33', 'model_name': 'test_v1.33', 'initial_workers': 1, 'response_timeout': 120, 'synchronous': True}

If the registration is successful, then the usual logs look like this

2023-08-17 19:58:02.536 | INFO     | manage:put:92 - Test_detection_12 not registeted. Registering model = Test_detection_12 with initial workers = 1

2023-08-17 19:58:02.536 | INFO     | manage:put:94 - Current cache = 1, max capacity = 5

2023-08-17 19:58:02.536 | INFO     | utils.api_handler:post:124 - Doing post request at api url = http://0.0.0.0:8081/models/ with payload = {'url': 'Test_detection_12', 'model_name': 'Test_detection_12', 'initial_workers': 1, 'response_timeout': 120, 'synchronous': True}

2023-08-17 19:58:04.666 | INFO     | utils.api_handler:post:126 - Received status code = 200, reason = OK

As you can see, If the registration is successful, then last line with status_code = 200 is usually logged. So to understand more, I checked the logs mms_log.log. They are given below

2023-08-17T19:59:33,195 [WARN ] W-9002-test_v1.33 com.amazonaws.ml.mms.wlm.WorkerLifeCycle - attachIOStreams() threadName=W-9002-test_v1.33
2023-08-17T19:59:33,195 [WARN ] W-9002-test_v1.33 com.amazonaws.ml.mms.wlm.WorkerLifeCycle - attachIOStreams() threadName=W-9002-test_v1.33
2023-08-17T19:59:33,265 [INFO ] W-9002-test_v1.33-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - model_service_worker started with args: --sock-type unix --sock-name /tmp/.mms.sock.9002 --handler /home/model-server/code/model_handler.py:handle --model-path /home/model-server/code/localstorage/models/test_v1.33 --model-name test_v1.33 --preload-model false --tmp-dir /tmp
2023-08-17T19:59:33,265 [INFO ] W-9002-test_v1.33-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - model_service_worker started with args: --sock-type unix --sock-name /tmp/.mms.sock.9002 --handler /home/model-server/code/model_handler.py:handle --model-path /home/model-server/code/localstorage/models/test_v1.33 --model-name test_v1.33 --preload-model false --tmp-dir /tmp
2023-08-17T19:59:33,267 [INFO ] W-9002-test_v1.33-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Listening on port: /tmp/.mms.sock.9002
2023-08-17T19:59:33,267 [INFO ] W-9002-test_v1.33-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Listening on port: /tmp/.mms.sock.9002
2023-08-17T19:59:33,274 [INFO ] W-9002-test_v1.33-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - [PID] 125
2023-08-17T19:59:33,274 [INFO ] W-9002-test_v1.33-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - [PID] 125
2023-08-17T19:59:33,274 [INFO ] W-9002-test_v1.33-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - MMS worker started.
2023-08-17T19:59:33,274 [INFO ] W-9002-test_v1.33-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - MMS worker started.
2023-08-17T19:59:33,274 [INFO ] W-9002-test_v1.33-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Python runtime: 3.8.0
2023-08-17T19:59:33,274 [INFO ] W-9002-test_v1.33-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Python runtime: 3.8.0
2023-08-17T19:59:33,274 [DEBUG] W-9002-test_v1.33 com.amazonaws.ml.mms.wlm.WorkerThread - W-9002-test_v1.33 State change null -> WORKER_MODEL_LOADED
2023-08-17T19:59:33,274 [INFO ] epollEventLoopGroup-3-2 com.amazonaws.ml.mms.wlm.ModelManager - Model test_v1.33 loaded.
2023-08-17T19:59:33,274 [DEBUG] W-9002-test_v1.33 com.amazonaws.ml.mms.wlm.WorkerThread - W-9002-test_v1.33 State change null -> WORKER_MODEL_LOADED
2023-08-17T19:59:33,274 [INFO ] epollEventLoopGroup-3-2 com.amazonaws.ml.mms.wlm.ModelManager - Model test_v1.33 loaded.
2023-08-17T19:59:33,274 [DEBUG] epollEventLoopGroup-3-2 com.amazonaws.ml.mms.wlm.ModelManager - updateModel: test_v1.33, count: 1
2023-08-17T19:59:33,274 [DEBUG] epollEventLoopGroup-3-2 com.amazonaws.ml.mms.wlm.ModelManager - updateModel: test_v1.33, count: 1
2023-08-17T19:59:33,274 [INFO ] W-9002-test_v1.33 MMS_METRICS - W-9002-test_v1.33.ms:82|#Level:Host|#hostname:4690944d4521,timestamp:1692302373
2023-08-17T19:59:33,275 [DEBUG] W-9002-test_v1.33 com.amazonaws.ml.mms.wlm.WorkerThread - W-9002-test_v1.33 State change null -> WORKER_STARTED
2023-08-17T19:59:33,275 [DEBUG] W-9002-test_v1.33 com.amazonaws.ml.mms.wlm.WorkerThread - W-9002-test_v1.33 State change null -> WORKER_STARTED
2023-08-17T19:59:33,276 [INFO ] W-9002-test_v1.33 com.amazonaws.ml.mms.wlm.WorkerThread - Connecting to: /tmp/.mms.sock.9002
2023-08-17T19:59:33,276 [INFO ] W-9002-test_v1.33 com.amazonaws.ml.mms.wlm.WorkerThread - Connecting to: /tmp/.mms.sock.9002
2023-08-17T19:59:33,277 [INFO ] W-9002-test_v1.33-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /tmp/.mms.sock.9002.
2023-08-17T19:59:33,277 [INFO ] W-9002-test_v1.33-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /tmp/.mms.sock.9002.
2023-08-17T19:59:44,347 [INFO ] W-9002-test_v1.33 com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 11064
2023-08-17T19:59:47,192 [WARN ] pool-3-thread-1 com.amazonaws.ml.mms.metrics.MetricCollector - worker pid is not available yet.

As you can see Backend response time: 11064, it means that the model is registered, but it follows with message worker pid is not available yet. After this there will be no log in the multi-model-server and it looks like the mms is stuck. I feel this is the issue and I wanted to understand

why this message is getting logged ?
What does it mean and how to resolve ?
Why doesn't this message didn't log when I was registering the model for first 2 times which were successfull (it was giving this message only after registering 2 times)

My config.properties file is given below

inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
preload_model=false
default_service_handler=/home/model-server/code/model_handler.py:handle
model_store=/home/model-server/code/localstorage/models

Also I went inside the docker container using docker exec command and there I ran the following command but I am not getting any response and it's stuck

curl http://0.0.0.0:8080/ping

Do let me know if you need more information.

Thanks.