Merged Model from Huggingface runs fine with fastchat CLI but not when using service worker
heli-sdsu opened this issue · comments
I am running Fastchat on kubernetes. I have a worker for the controller, the fastchat api and a (gpu) worker for each of the models. When I pull this model from huggingface (downloaded using huggingface-cli) https://huggingface.co/Rmote6603/MedPrescription-FineTuning, I run the fastchat CLI command and type in my prompt, it works perfectly fine as expected:
python3.9 -m fastchat.serve.cli --model-path MedPrescription-FineTuning
However, when I use the fastchat.serve.model_worker, it does not work at all when I try to use chat completion API, it gives me an error, even though v1/models API works as shown in the photo below:
python3.9 -m fastchat.serve.model_worker --model-path MedPrescription-FineTuning --worker-address http://localhost:21002 --port 21002
When I run this POST request,
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -H "Authorization:Bearer API-TOKEN" -d '{ "model": "MedPrescription-FineTuning", "messages": [{"role": "user", "content": "Hello! What is your name?"}] }'
It first times out:
Then it subsequently gives me Network Error:
{"object":"error","message":"**NETWORK ERROR DUE TO HIGH TRAFFIC. PLEASE REGENERATE OR REFRESH THIS PAGE.**\n\n(probability tensor contains either
inf,
nan or element < 0)","code":50001}
I was wondering if anyone else has ran into this issue before. Does it have anything to do with Huggingface, models weights or something with FastChat limitations. I have only having issues with this merged mistral model.