fauxpilot / fauxpilot

Issue Description

Due to the Python backend, FauxPilot returns an inference error as follows.

https://github.com/fauxpilot/fauxpilot/blame/main/setup.sh#L176

Expected Result

Generate source code can be generated as a result of inferencing task.

int hello( ){ return "Hello"; }

However, the Python-based FauxPilot Server backend does not produce the correct result.

How to Reproduce

git clone the "fauxpilot/fauxpilot" repository
./setup.sh

(deepspeed) invain@mymate:/work/qtlab/fauxpilot$ ./setup.sh
.env already exists, do you want to delete .env and recreate it? [y/n] y
Deleting .env
Checking for curl ...
/usr/bin/curl
Checking for zstd ...
/home/invain/anaconda3/envs/deepspeed/bin/zstd
Checking for docker ...
/usr/bin/docker
Enter number of GPUs [1]: 1
External port for the API [5000]:
Address for Triton [triton]:
Port of Triton host [8001]:
Where do you want to save your models [/work/qtlab/fauxpilot/models]?
Choose your backend:
[1] FasterTransformer backend (faster, but limited models)
[2] Python backend (slower, but more models, and allows loading with int8)
Enter your choice [1]: 2
Models available:
[1] codegen-350M-mono (1GB total VRAM required; Python-only)
[2] codegen-350M-multi (1GB total VRAM required; multi-language)
[3] codegen-2B-mono (4GB total VRAM required; Python-only)
[4] codegen-2B-multi (4GB total VRAM required; multi-language)
Enter your choice [4]: 2
Do you want to share your huggingface cache between host and docker container? y/n [n]: 2
Do you want to use int8? y/n [y]: y
Config written to /work/qtlab/fauxpilot/models/py-Salesforce-codegen-350M-multi/py-model/config.pbtxt
docker: 'compose' is not a docker command.
See 'docker --help'
Config complete, do you want to run FauxPilot? [y/n] y

./launch.sh

......................... Omission ......................
Attaching to fauxpilot-copilot_proxy-1, fauxpilot-triton-1
fauxpilot-triton-1         |
fauxpilot-triton-1         | =============================
fauxpilot-triton-1         | == Triton Inference Server ==
fauxpilot-triton-1         | =============================
fauxpilot-triton-1         |
fauxpilot-triton-1         | NVIDIA Release 22.06 (build 39726160)
fauxpilot-triton-1         | Triton Server Version 2.23.0
fauxpilot-triton-1         |
fauxpilot-triton-1         | Copyright (c) 2018-2022, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
fauxpilot-triton-1         |
fauxpilot-triton-1         | Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
fauxpilot-triton-1         |
fauxpilot-triton-1         | This container image and its contents are governed by the NVIDIA Deep Learning Container License.
fauxpilot-triton-1         | By pulling and using the container, you accept the terms and conditions of this license:
fauxpilot-triton-1         | https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
fauxpilot-copilot_proxy-1  | INFO:     Started server process [1]
fauxpilot-copilot_proxy-1  | INFO:     Waiting for application startup.
fauxpilot-copilot_proxy-1  | INFO:     Application startup complete.
fauxpilot-copilot_proxy-1  | INFO:     Uvicorn running on http://0.0.0.0:5000 (Press CTRL+C to quit)
fauxpilot-triton-1         |
fauxpilot-triton-1         | I0102 07:53:07.342335 88 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7fc820000000' with size 268435456
fauxpilot-triton-1         | I0102 07:53:07.342492 88 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
fauxpilot-triton-1         | I0102 07:53:07.343283 88 model_repository_manager.cc:1191] loading: py-model:1
fauxpilot-triton-1         | I0102 07:53:07.550677 88 python_be.cc:1774] TRITONBACKEND_ModelInstanceInitialize: py-model_0 (CPU device 0)
fauxpilot-triton-1         | Cuda available? True
fauxpilot-triton-1         | is_half: True, int8: True, auto_device_map: True
fauxpilot-triton-1         | Model Salesforce/codegen-350M-multi Loaded. Footprint: 545652736
fauxpilot-triton-1         | I0102 07:53:26.800350 88 model_repository_manager.cc:1345] successfully loaded 'py-model' version 1
fauxpilot-triton-1         | I0102 07:53:26.812968 88 server.cc:556]
fauxpilot-triton-1         | +------------------+------+
fauxpilot-triton-1         | | Repository Agent | Path |
fauxpilot-triton-1         | +------------------+------+
fauxpilot-triton-1         | +------------------+------+
fauxpilot-triton-1         |
fauxpilot-triton-1         | I0102 07:53:26.813087 88 server.cc:583]
fauxpilot-triton-1         | +---------+-------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
fauxpilot-triton-1         | | Backend | Path                                                  | Config                                                                                                                                                         |
fauxpilot-triton-1         | +---------+-------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
fauxpilot-triton-1         | | python  | /opt/tritonserver/backends/python/libtriton_python.so | {"cmdline":{"auto-complete-config":"false","min-compute-capability":"6.000000","backend-directory":"/opt/tritonserver/backends","default-max-batch-size":"4"}} |
fauxpilot-triton-1         | +---------+-------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
fauxpilot-triton-1         |
fauxpilot-triton-1         | I0102 07:53:26.818532 88 server.cc:626]
fauxpilot-triton-1         | +----------+---------+--------+
fauxpilot-triton-1         | | Model    | Version | Status |
fauxpilot-triton-1         | +----------+---------+--------+
fauxpilot-triton-1         | | py-model | 1       | READY  |
fauxpilot-triton-1         | +----------+---------+--------+
fauxpilot-triton-1         |
fauxpilot-triton-1         | I0102 07:53:26.954685 88 metrics.cc:650] Collecting metrics for GPU 0: NVIDIA TITAN Xp
fauxpilot-triton-1         | I0102 07:53:26.954864 88 tritonserver.cc:2159]
fauxpilot-triton-1         | +----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
fauxpilot-triton-1         | | Option                           | Value                                                                                                                                                                                        |
fauxpilot-triton-1         | +----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
fauxpilot-triton-1         | | server_id                        | triton                                                                                                                                                                                       |
fauxpilot-triton-1         | | server_version                   | 2.23.0                                                                                                                                                                                       |
fauxpilot-triton-1         | | server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics trace |
fauxpilot-triton-1         | | model_repository_path[0]         | /model                                                                                                                                                                                       |
fauxpilot-triton-1         | | model_control_mode               | MODE_NONE                                                                                                                                                                                    |
fauxpilot-triton-1         | | strict_model_config              | 1                                                                                                                                                                                            |
fauxpilot-triton-1         | | rate_limit                       | OFF                                                                                                                                                                                          |
fauxpilot-triton-1         | | pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                                    |
fauxpilot-triton-1         | | cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                                                     |
fauxpilot-triton-1         | | response_cache_byte_size         | 0                                                                                                                                                                                            |
fauxpilot-triton-1         | | min_supported_compute_capability | 6.0                                                                                                                                                                                          |
fauxpilot-triton-1         | | strict_readiness                 | 1                                                                                                                                                                                            |
fauxpilot-triton-1         | | exit_timeout                     | 30                                                                                                                                                                                           |
fauxpilot-triton-1         | +----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
fauxpilot-triton-1         |
fauxpilot-triton-1         | I0102 07:53:26.963493 88 grpc_server.cc:4587] Started GRPCInferenceService at 0.0.0.0:8001
fauxpilot-triton-1         | I0102 07:53:26.982220 88 http_server.cc:3303] Started HTTPService at 0.0.0.0:8000
fauxpilot-triton-1         | I0102 07:53:27.026420 88 http_server.cc:178] Started Metrics Service at 0.0.0.0:8002
fauxpilot-copilot_proxy-1  | [StatusCode.UNAVAILABLE] Request for unknown model: 'fastertransformer' is not found
fauxpilot-copilot_proxy-1  | Returned completion in 19.145488739013672 ms
fauxpilot-copilot_proxy-1  | INFO:     2023-01-02 07:53:47,940 :: 172.18.0.1:60666 - "POST /v1/engines/codegen/completions HTTP/1.1" 200 OK
fauxpilot-copilot_proxy-1  | [StatusCode.UNAVAILABLE] Request for unknown model: 'fastertransformer' is not found
fauxpilot-copilot_proxy-1  | Returned completion in 39.86477851867676 ms
fauxpilot-copilot_proxy-1  | INFO:     2023-01-02 07:58:39,684 :: 172.18.0.1:41450 - "POST /v1/engines/codegen/completions HTTP/1.1" 200 OK
fauxpilot-copilot_proxy-1  | [StatusCode.UNAVAILABLE] Request for unknown model: 'fastertransformer' is not found
fauxpilot-copilot_proxy-1  | Returned completion in 1.708984375 ms
fauxpilot-copilot_proxy-1  | INFO:     2023-01-02 07:58:47,537 :: 172.18.0.1:50756 - "POST /v1/engines/codegen/completions HTTP/1.1" 200 OK

Run a client

curl -s -H "Accept: application/json" -H "Content-type: application/json" -X POST -d '{"prompt":"int hello(){","max_tokens":50,"temperature":0.1,"stop":["\n\n"]}' http://localhost:5000/v1/engines/codegen/completions

Further Information

I used the "Python backend" while running the "setup.sh" file.

I think you're sending the request for fastertransformer backend. That's what the error message says.

In the request body, set model to py-model. I guess I should document this somewhere.

Do you mean this?

fauxpilot/tests/python_backend/test_setup.py

Line 81 in 8df5058

prompt: str, model_name: str = "py-model", port: int = 5000, return_all: bool = False,
- model_name: str = "py-model"
fauxpilot/python_backend/config_template.pbtxt

Line 1 in 8df5058

name: "py-model"
- name: "py-model"
- backend: "python"

However, the exeuction result with rest API is as follows. :(

(deepspeed) invain@mymate:/work/qtlab/fauxpilot$ curl -s -H "Accept: application/json" -H "Content-type: application/json" -X POST -d '{"model_name":"py-model","prompt":"int hello(){","max_tokens":50,"temperature":0.1,"stop":["\n\n"]}' http://localhost:5000/v1/engines/codegen/completions
{"id": "cmpl-DpyzbjCEdBOptUzfmvlgHdMtxG4D7", "choices": []}(deepspeed) invain@mymate:/work/qtlab/fauxpilot$
(deepspeed) invain@mymate:/work/qtlab/fauxpilot$
(deepspeed) invain@mymate:/work/qtlab/fauxpilot$
(deepspeed) invain@mymate:/work/qtlab/fauxpilot$ curl -s -H "Accept: application/json" -H "Content-type: application/json" -X POST -d '{"model-name":"py-model","prompt":"int hello(){","max_tokens":50,"temperature":0.1,"stop":["\n\n"]}' http://localhost:5000/v1/engines/codegen/completions
{"id": "cmpl-uDQI1w42r9rX4R6g5WTuZPfuaJbEo", "choices": []}(deepspeed) invain@mymate:/work/qtlab/fauxpilot$
(deepspeed) invain@mymate:/work/qtlab/fauxpilot$
(deepspeed) invain@mymate:/work/qtlab/fauxpilot$
(deepspeed) invain@mymate:/work/qtlab/fauxpilot$ curl -s -H "Accept: application/json" -H "Content-type: application/json" -X POST -d '{"model":"py-model","prompt":"int hello(){","max_tokens":50,"temperature":0.1,"stop":["\n\n"]}' http://localhost:5000/v1/engines/codegen/completions
{"id": "cmpl-d9YNLmAPJbVI5ZRczJmlrRIBpqza0", "choices": []}(deepspeed) invain@mymate:/work/qtlab/fauxpilot$
(deepspeed) invain@mymate:/work/qtlab/fauxpilot$
(deepspeed) invain@mymate:/work/qtlab/fauxpilot$

The main issue here is that the key should be model and not model_name. Here, the backend was setting model to the default value of 'fastertransformer'. (The server logs will show that).

#135 tries to address this (currently by showing a warning in the backend logs if user tries to access a non-existent model).

Additionally, the current code sets logprobs to 1 if not supplied (I feel this shouldn't be the case) and python backend doesn't yet support logprobs. #135 also addresses this.

Till the PR gets merged, the following curl request will work:

curl -s -H "Accept: application/json" -H "Content-type: application/json" -X POST -d '{"model":"py-model","prompt":"int hello(){","max_tokens":50,"temperature":0.1,"stop":["\n\n"], "logprobs": 0}' http://localhost:5000/v1/engines/codegen/completions

Till the PR gets merged, the following curl request will work:

curl -s -H "Accept: application/json" -H "Content-type: application/json" -X POST -d '{"model":"py-model","prompt":"int hello(){","max_tokens":50,"temperature":0.1,"stop":["\n\n"], "logprobs": 0}' http://localhost:5000/v1/engines/codegen/completions

It is odd. Could you provide us with the output of the command?
I encountered the following error with your comment: (e.g., logprobs:0)

Log messages:

(deepspeed) invain@mymate:/work/qtlab/fauxpilot$ curl -s -H "Accept: application/json" -H "Content-type: application/json" -X POST -d '{"model":"py-model","prompt":"int hello(){","max_tokens":50,"temperature":0.1,"stop":["\n\n"], "logprobs": 0}' http://localhost:5000/v1/engines/codegen/completions
{"id": "cmpl-gdYgnWiPT4nsAOX4DXrLnHgDpTEAa", "choices": []}(deepspeed) invain@mymate:/work/qtlab/fauxpilot$
(deepspeed) invain@mymate:/work/qtlab/fauxpilot$
(deepspeed) invain@mymate:/work/qtlab/fauxpilot$
(deepspeed) invain@mymate:/work/qtlab/fauxpilot$ tree  models/py-Salesforce-codegen-350M-multi/
models/py-Salesforce-codegen-350M-multi/
└── py-model
    ├── 1
    │   ├── __pycache__
    │   │   └── model.cpython-38.pyc
    │   └── model.py
    └── config.pbtxt

3 directories, 3 files

As I posted this issue, I'm looking for a reproducible way to get correct inference results with a "Python-backend".

@leemgs is this issue still occurring? I think #135 fixes it regardless of #137 (that's just changing the default-ness of int8 during installation)

Closing for now, please reopen if it's still an issue.

Yepp. Thanks.

Due to the Python backend, FauxPilot returns an inference error.

Issue Description

Expected Result

How to Reproduce

Further Information