Merged Model from Huggingface runs fine with fastchat CLI but not when using service worker

Question

Merged Model from Huggingface runs fine with fastchat CLI but not when using service worker

heli-sdsu opened this issue a month ago · comments

I am running Fastchat on kubernetes. I have a worker for the controller, the fastchat api and a (gpu) worker for each of the models. When I pull this model from huggingface (downloaded using huggingface-cli) https://huggingface.co/Rmote6603/MedPrescription-FineTuning, I run the fastchat CLI command and type in my prompt, it works perfectly fine as expected:
python3.9 -m fastchat.serve.cli --model-path MedPrescription-FineTuning

However, when I use the fastchat.serve.model_worker, it does not work at all when I try to use chat completion API, it gives me an error, even though v1/models API works as shown in the photo below:
python3.9 -m fastchat.serve.model_worker --model-path MedPrescription-FineTuning --worker-address http://localhost:21002 --port 21002

When I run this POST request,
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -H "Authorization:Bearer API-TOKEN" -d '{ "model": "MedPrescription-FineTuning", "messages": [{"role": "user", "content": "Hello! What is your name?"}] }'
It first times out:

Then it subsequently gives me Network Error:

{"object":"error","message":"**NETWORK ERROR DUE TO HIGH TRAFFIC. PLEASE REGENERATE OR REFRESH THIS PAGE.**\n\n(probability tensor contains either inf, nan or element < 0)","code":50001}

I was wondering if anyone else has ran into this issue before. Does it have anything to do with Huggingface, models weights or something with FastChat limitations. I have only having issues with this merged mistral model.

Henry Li · Answer 1 · Mon May 20 2024 23:18:40 GMT+0800 (China Standard Time)

Update. When I use the webui to host the model this is what I get. I suppose the getway time-out response is due to the model not knowing when to stop generating.