lm-sys / FastChat

An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Merged Model from Huggingface runs fine with fastchat CLI but not when using service worker

heli-sdsu opened this issue · comments

I am running Fastchat on kubernetes. I have a worker for the controller, the fastchat api and a (gpu) worker for each of the models. When I pull this model from huggingface (downloaded using huggingface-cli) https://huggingface.co/Rmote6603/MedPrescription-FineTuning, I run the fastchat CLI command and type in my prompt, it works perfectly fine as expected:
python3.9 -m fastchat.serve.cli --model-path MedPrescription-FineTuning

Screenshot 2024-05-06 at 5 37 35 PM

However, when I use the fastchat.serve.model_worker, it does not work at all when I try to use chat completion API, it gives me an error, even though v1/models API works as shown in the photo below:
python3.9 -m fastchat.serve.model_worker --model-path MedPrescription-FineTuning --worker-address http://localhost:21002 --port 21002

Screenshot 2024-05-06 at 5 33 06 PM

When I run this POST request,
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -H "Authorization:Bearer API-TOKEN" -d '{ "model": "MedPrescription-FineTuning", "messages": [{"role": "user", "content": "Hello! What is your name?"}] }'
It first times out:

Screenshot 2024-05-06 at 5 26 11 PM

Then it subsequently gives me Network Error:

{"object":"error","message":"**NETWORK ERROR DUE TO HIGH TRAFFIC. PLEASE REGENERATE OR REFRESH THIS PAGE.**\n\n(probability tensor contains either inf, nan or element < 0)","code":50001}

I was wondering if anyone else has ran into this issue before. Does it have anything to do with Huggingface, models weights or something with FastChat limitations. I have only having issues with this merged mistral model.

Update. When I use the webui to host the model this is what I get. I suppose the getway time-out response is due to the model not knowing when to stop generating.

image