Due to the Python backend, FauxPilot returns an inference error.
leemgs opened this issue · comments
Issue Description
Due to the Python backend, FauxPilot returns an inference error as follows.
Expected Result
Generate source code can be generated as a result of inferencing task.
int hello( ){ return "Hello"; }
However, the Python-based FauxPilot Server backend does not produce the correct result.
How to Reproduce
- git clone the "fauxpilot/fauxpilot" repository
- ./setup.sh
(deepspeed) invain@mymate:/work/qtlab/fauxpilot$ ./setup.sh
.env already exists, do you want to delete .env and recreate it? [y/n] y
Deleting .env
Checking for curl ...
/usr/bin/curl
Checking for zstd ...
/home/invain/anaconda3/envs/deepspeed/bin/zstd
Checking for docker ...
/usr/bin/docker
Enter number of GPUs [1]: 1
External port for the API [5000]:
Address for Triton [triton]:
Port of Triton host [8001]:
Where do you want to save your models [/work/qtlab/fauxpilot/models]?
Choose your backend:
[1] FasterTransformer backend (faster, but limited models)
[2] Python backend (slower, but more models, and allows loading with int8)
Enter your choice [1]: 2
Models available:
[1] codegen-350M-mono (1GB total VRAM required; Python-only)
[2] codegen-350M-multi (1GB total VRAM required; multi-language)
[3] codegen-2B-mono (4GB total VRAM required; Python-only)
[4] codegen-2B-multi (4GB total VRAM required; multi-language)
Enter your choice [4]: 2
Do you want to share your huggingface cache between host and docker container? y/n [n]: 2
Do you want to use int8? y/n [y]: y
Config written to /work/qtlab/fauxpilot/models/py-Salesforce-codegen-350M-multi/py-model/config.pbtxt
docker: 'compose' is not a docker command.
See 'docker --help'
Config complete, do you want to run FauxPilot? [y/n] y
- ./launch.sh
......................... Omission ......................
Attaching to fauxpilot-copilot_proxy-1, fauxpilot-triton-1
fauxpilot-triton-1 |
fauxpilot-triton-1 | =============================
fauxpilot-triton-1 | == Triton Inference Server ==
fauxpilot-triton-1 | =============================
fauxpilot-triton-1 |
fauxpilot-triton-1 | NVIDIA Release 22.06 (build 39726160)
fauxpilot-triton-1 | Triton Server Version 2.23.0
fauxpilot-triton-1 |
fauxpilot-triton-1 | Copyright (c) 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
fauxpilot-triton-1 |
fauxpilot-triton-1 | Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.
fauxpilot-triton-1 |
fauxpilot-triton-1 | This container image and its contents are governed by the NVIDIA Deep Learning Container License.
fauxpilot-triton-1 | By pulling and using the container, you accept the terms and conditions of this license:
fauxpilot-triton-1 | https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
fauxpilot-copilot_proxy-1 | INFO: Started server process [1]
fauxpilot-copilot_proxy-1 | INFO: Waiting for application startup.
fauxpilot-copilot_proxy-1 | INFO: Application startup complete.
fauxpilot-copilot_proxy-1 | INFO: Uvicorn running on http://0.0.0.0:5000 (Press CTRL+C to quit)
fauxpilot-triton-1 |
fauxpilot-triton-1 | I0102 07:53:07.342335 88 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7fc820000000' with size 268435456
fauxpilot-triton-1 | I0102 07:53:07.342492 88 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
fauxpilot-triton-1 | I0102 07:53:07.343283 88 model_repository_manager.cc:1191] loading: py-model:1
fauxpilot-triton-1 | I0102 07:53:07.550677 88 python_be.cc:1774] TRITONBACKEND_ModelInstanceInitialize: py-model_0 (CPU device 0)
fauxpilot-triton-1 | Cuda available? True
fauxpilot-triton-1 | is_half: True, int8: True, auto_device_map: True
fauxpilot-triton-1 | Model Salesforce/codegen-350M-multi Loaded. Footprint: 545652736
fauxpilot-triton-1 | I0102 07:53:26.800350 88 model_repository_manager.cc:1345] successfully loaded 'py-model' version 1
fauxpilot-triton-1 | I0102 07:53:26.812968 88 server.cc:556]
fauxpilot-triton-1 | +------------------+------+
fauxpilot-triton-1 | | Repository Agent | Path |
fauxpilot-triton-1 | +------------------+------+
fauxpilot-triton-1 | +------------------+------+
fauxpilot-triton-1 |
fauxpilot-triton-1 | I0102 07:53:26.813087 88 server.cc:583]
fauxpilot-triton-1 | +---------+-------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
fauxpilot-triton-1 | | Backend | Path | Config |
fauxpilot-triton-1 | +---------+-------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
fauxpilot-triton-1 | | python | /opt/tritonserver/backends/python/libtriton_python.so | {"cmdline":{"auto-complete-config":"false","min-compute-capability":"6.000000","backend-directory":"/opt/tritonserver/backends","default-max-batch-size":"4"}} |
fauxpilot-triton-1 | +---------+-------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
fauxpilot-triton-1 |
fauxpilot-triton-1 | I0102 07:53:26.818532 88 server.cc:626]
fauxpilot-triton-1 | +----------+---------+--------+
fauxpilot-triton-1 | | Model | Version | Status |
fauxpilot-triton-1 | +----------+---------+--------+
fauxpilot-triton-1 | | py-model | 1 | READY |
fauxpilot-triton-1 | +----------+---------+--------+
fauxpilot-triton-1 |
fauxpilot-triton-1 | I0102 07:53:26.954685 88 metrics.cc:650] Collecting metrics for GPU 0: NVIDIA TITAN Xp
fauxpilot-triton-1 | I0102 07:53:26.954864 88 tritonserver.cc:2159]
fauxpilot-triton-1 | +----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
fauxpilot-triton-1 | | Option | Value |
fauxpilot-triton-1 | +----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
fauxpilot-triton-1 | | server_id | triton |
fauxpilot-triton-1 | | server_version | 2.23.0 |
fauxpilot-triton-1 | | server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics trace |
fauxpilot-triton-1 | | model_repository_path[0] | /model |
fauxpilot-triton-1 | | model_control_mode | MODE_NONE |
fauxpilot-triton-1 | | strict_model_config | 1 |
fauxpilot-triton-1 | | rate_limit | OFF |
fauxpilot-triton-1 | | pinned_memory_pool_byte_size | 268435456 |
fauxpilot-triton-1 | | cuda_memory_pool_byte_size{0} | 67108864 |
fauxpilot-triton-1 | | response_cache_byte_size | 0 |
fauxpilot-triton-1 | | min_supported_compute_capability | 6.0 |
fauxpilot-triton-1 | | strict_readiness | 1 |
fauxpilot-triton-1 | | exit_timeout | 30 |
fauxpilot-triton-1 | +----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
fauxpilot-triton-1 |
fauxpilot-triton-1 | I0102 07:53:26.963493 88 grpc_server.cc:4587] Started GRPCInferenceService at 0.0.0.0:8001
fauxpilot-triton-1 | I0102 07:53:26.982220 88 http_server.cc:3303] Started HTTPService at 0.0.0.0:8000
fauxpilot-triton-1 | I0102 07:53:27.026420 88 http_server.cc:178] Started Metrics Service at 0.0.0.0:8002
fauxpilot-copilot_proxy-1 | [StatusCode.UNAVAILABLE] Request for unknown model: 'fastertransformer' is not found
fauxpilot-copilot_proxy-1 | Returned completion in 19.145488739013672 ms
fauxpilot-copilot_proxy-1 | INFO: 2023-01-02 07:53:47,940 :: 172.18.0.1:60666 - "POST /v1/engines/codegen/completions HTTP/1.1" 200 OK
fauxpilot-copilot_proxy-1 | [StatusCode.UNAVAILABLE] Request for unknown model: 'fastertransformer' is not found
fauxpilot-copilot_proxy-1 | Returned completion in 39.86477851867676 ms
fauxpilot-copilot_proxy-1 | INFO: 2023-01-02 07:58:39,684 :: 172.18.0.1:41450 - "POST /v1/engines/codegen/completions HTTP/1.1" 200 OK
fauxpilot-copilot_proxy-1 | [StatusCode.UNAVAILABLE] Request for unknown model: 'fastertransformer' is not found
fauxpilot-copilot_proxy-1 | Returned completion in 1.708984375 ms
fauxpilot-copilot_proxy-1 | INFO: 2023-01-02 07:58:47,537 :: 172.18.0.1:50756 - "POST /v1/engines/codegen/completions HTTP/1.1" 200 OK
- Run a client
curl -s -H "Accept: application/json" -H "Content-type: application/json" -X POST -d '{"prompt":"int hello(){","max_tokens":50,"temperature":0.1,"stop":["\n\n"]}' http://localhost:5000/v1/engines/codegen/completions
Further Information
- I used the "Python backend" while running the "setup.sh" file.
I think you're sending the request for fastertransformer backend. That's what the error message says.
In the request body, set model
to py-model
. I guess I should document this somewhere.
Do you mean this?
-
- model_name: str = "py-model"
-
- name: "py-model"
- backend: "python"
However, the exeuction result with rest API is as follows. :(
(deepspeed) invain@mymate:/work/qtlab/fauxpilot$ curl -s -H "Accept: application/json" -H "Content-type: application/json" -X POST -d '{"model_name":"py-model","prompt":"int hello(){","max_tokens":50,"temperature":0.1,"stop":["\n\n"]}' http://localhost:5000/v1/engines/codegen/completions
{"id": "cmpl-DpyzbjCEdBOptUzfmvlgHdMtxG4D7", "choices": []}(deepspeed) invain@mymate:/work/qtlab/fauxpilot$
(deepspeed) invain@mymate:/work/qtlab/fauxpilot$
(deepspeed) invain@mymate:/work/qtlab/fauxpilot$
(deepspeed) invain@mymate:/work/qtlab/fauxpilot$ curl -s -H "Accept: application/json" -H "Content-type: application/json" -X POST -d '{"model-name":"py-model","prompt":"int hello(){","max_tokens":50,"temperature":0.1,"stop":["\n\n"]}' http://localhost:5000/v1/engines/codegen/completions
{"id": "cmpl-uDQI1w42r9rX4R6g5WTuZPfuaJbEo", "choices": []}(deepspeed) invain@mymate:/work/qtlab/fauxpilot$
(deepspeed) invain@mymate:/work/qtlab/fauxpilot$
(deepspeed) invain@mymate:/work/qtlab/fauxpilot$
(deepspeed) invain@mymate:/work/qtlab/fauxpilot$ curl -s -H "Accept: application/json" -H "Content-type: application/json" -X POST -d '{"model":"py-model","prompt":"int hello(){","max_tokens":50,"temperature":0.1,"stop":["\n\n"]}' http://localhost:5000/v1/engines/codegen/completions
{"id": "cmpl-d9YNLmAPJbVI5ZRczJmlrRIBpqza0", "choices": []}(deepspeed) invain@mymate:/work/qtlab/fauxpilot$
(deepspeed) invain@mymate:/work/qtlab/fauxpilot$
(deepspeed) invain@mymate:/work/qtlab/fauxpilot$
The main issue here is that the key should be model
and not model_name
. Here, the backend was setting model
to the default value of 'fastertransformer'. (The server logs will show that).
#135 tries to address this (currently by showing a warning in the backend logs if user tries to access a non-existent model).
Additionally, the current code sets logprobs to 1 if not supplied (I feel this shouldn't be the case) and python backend doesn't yet support logprobs. #135 also addresses this.
Till the PR gets merged, the following curl request will work:
curl -s -H "Accept: application/json" -H "Content-type: application/json" -X POST -d '{"model":"py-model","prompt":"int hello(){","max_tokens":50,"temperature":0.1,"stop":["\n\n"], "logprobs": 0}' http://localhost:5000/v1/engines/codegen/completions
Till the PR gets merged, the following curl request will work:
curl -s -H "Accept: application/json" -H "Content-type: application/json" -X POST -d '{"model":"py-model","prompt":"int hello(){","max_tokens":50,"temperature":0.1,"stop":["\n\n"], "logprobs": 0}' http://localhost:5000/v1/engines/codegen/completions
It is odd. Could you provide us with the output of the command?
I encountered the following error with your comment: (e.g., logprobs:0)
- Log messages:
(deepspeed) invain@mymate:/work/qtlab/fauxpilot$ curl -s -H "Accept: application/json" -H "Content-type: application/json" -X POST -d '{"model":"py-model","prompt":"int hello(){","max_tokens":50,"temperature":0.1,"stop":["\n\n"], "logprobs": 0}' http://localhost:5000/v1/engines/codegen/completions
{"id": "cmpl-gdYgnWiPT4nsAOX4DXrLnHgDpTEAa", "choices": []}(deepspeed) invain@mymate:/work/qtlab/fauxpilot$
(deepspeed) invain@mymate:/work/qtlab/fauxpilot$
(deepspeed) invain@mymate:/work/qtlab/fauxpilot$
(deepspeed) invain@mymate:/work/qtlab/fauxpilot$ tree models/py-Salesforce-codegen-350M-multi/
models/py-Salesforce-codegen-350M-multi/
└── py-model
├── 1
│ ├── __pycache__
│ │ └── model.cpython-38.pyc
│ └── model.py
└── config.pbtxt
3 directories, 3 files
As I posted this issue, I'm looking for a reproducible way to get correct inference results with a "Python-backend".
Yepp. Thanks.