Acess HF models locally

Question

Acess HF models locally

Rachneet opened this issue 9 months ago · comments

Rachneet Singh Sachdeva commented 9 months ago

Hi,

Could you give an example usage of using locally stored models in the container.

IT Lackey · Answer 1 · Tue Sep 26 2023 03:10:42 GMT+0800 (China Standard Time)

Here are the docs docs for using the FastChat API

https://github.com/lm-sys/FastChat#api

Once you have the container running you can follow these examples to connect to the API.

If you need HF API, you can enable it in the startup.sh for now. I will add a flag for it soon.

Rachneet Singh Sachdeva · Answer 2 · Tue Sep 26 2023 03:17:04 GMT+0800 (China Standard Time)

Thanks a lot for a quick reply. I have been looking at their docs. I have another issue that when i run the container interactively, i get the following error

2023-09-25 19:11:35 | INFO | stdout | INFO:     127.0.0.1:50200 - "POST /worker_get_status HTTP/1.1" 404 Not Found
2023-09-25 19:11:35 | ERROR | controller | Get status fails: http://localhost:21002, <Response [404]>
2023-09-25 19:11:35 | INFO | controller | Remove stale worker: http://localhost:21002

And then we are in the waiting state. Do you have any idea why this may happen?

This is my docker-compose file, I am using CUDA here since that is what I need

version: "3.9"

services:
  llm_chat:
    build:
      context: .
    container_name: llm_chat
    volumes:
      - /home/abc/hf_models:/root/.cache/huggingface
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    ports:
      - "7860:7860"
      - "8000:8000"

IT Lackey · Answer 3 · Tue Sep 26 2023 03:48:44 GMT+0800 (China Standard Time)

This container is specifically built for Intel GPUs. You shouldn't need this container to get FastChat to run on your system. You should be able to follow their easiest install instructions and it should work with your Nvidia GPU.

Rachneet Singh Sachdeva · Answer 4 · Tue Sep 26 2023 04:21:20 GMT+0800 (China Standard Time)

Yes, you're correct but this error does not relate to that. Do you have any insights as to why the model worker gets registered but does not work?

IT Lackey · Answer 5 · Tue Sep 26 2023 11:00:46 GMT+0800 (China Standard Time)

I am not seeing much detail in the logs you've provided but it's very likely that the worker is failing to start due to the incorrect hardware.
As much as I appreciate you using this container, it's really not what you need for your system.
You can run FastChat without docker very easily since you have a CUDA device.

Rachneet Singh Sachdeva · Answer 6 · Tue Sep 26 2023 15:24:34 GMT+0800 (China Standard Time)

Thanks for the insight. I see this particular error

llm_chat  | 2023-09-26 07:23:08 | INFO | stdout | INFO:     127.0.0.1:38810 - "POST /refresh_all_workers HTTP/1.1" 200 OK
llm_chat  | 2023-09-26 07:23:08 | INFO | stdout | INFO:     127.0.0.1:38820 - "GET /list_models HTTP/1.1" 200 OK
llm_chat  | 2023-09-26 07:23:08 | INFO | stdout | INFO:     127.0.0.1:38834 - "POST /get_worker_address HTTP/1.1" 200 OK
llm_chat  | Waiting for model...
llm_chat  | '(MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /lmsys/vicuna-7b-v1.3/resolve/main/tokenizer_config.json (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7fe341026910>, 'Connection to huggingface.co timed out. (connect timeout=10)'))"), '(Request ID: 71b19d48-1d06-43e0-8a5f-be411bc85bc9)')' thrown while requesting HEAD https://huggingface.co/lmsys/vicuna-7b-v1.3/resolve/main/tokenizer_config.json
llm_chat  | 2023-09-26 07:23:09 | WARNING | huggingface_hub.utils._http | '(MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /lmsys/vicuna-7b-v1.3/resolve/main/tokenizer_config.json (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7fe341026910>, 'Connection to huggingface.co timed out. (connect timeout=10)'))"), '(Request ID: 71b19d48-1d06-43e0-8a5f-be411bc85bc9)')' thrown while requesting HEAD https://huggingface.co/lmsys/vicuna-7b-v1.3/resolve/main/tokenizer_config.json
llm_chat  | You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. If you see this, DO NOT PANIC! This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
llm_chat  | 2023-09-26 07:23:13 | INFO | stdout | INFO:     127.0.0.1:55518 - "POST /refresh_all_workers HTTP/1.1" 200 OK
llm_chat  | 2023-09-26 07:23:13 | INFO | stdout | INFO:     127.0.0.1:55532 - "GET /list_models HTTP/1.1" 200 OK
llm_chat  | 2023-09-26 07:23:13 | INFO | stdout | INFO:     127.0.0.1:55538 - "POST /get_worker_address HTTP/1.1" 200 OK
llm_chat  | Waiting for model...

which means that the model cannot be downloaded. Any idea why this may be happening?

Rachneet Singh Sachdeva · Answer 7 · Wed Sep 27 2023 18:43:49 GMT+0800 (China Standard Time)

Solved by using the models saved locally.

IT Lackey · Answer 8 · Wed Sep 27 2023 21:01:55 GMT+0800 (China Standard Time)

Good job working through that. I did not see the timeout issue before.