fauxpilot / fauxpilot

FauxPilot - an open-source alternative to GitHub Copilot server

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Due to the Python backend, FauxPilot returns an inference error.

leemgs opened this issue · comments

Issue Description

Due to the Python backend, FauxPilot returns an inference error as follows.

Expected Result

Generate source code can be generated as a result of inferencing task.

int hello( ){ return "Hello"; }

However, the Python-based FauxPilot Server backend does not produce the correct result.

How to Reproduce

  1. git clone the "fauxpilot/fauxpilot" repository
  2. ./setup.sh
(deepspeed) invain@mymate:/work/qtlab/fauxpilot$ ./setup.sh
.env already exists, do you want to delete .env and recreate it? [y/n] y
Deleting .env
Checking for curl ...
/usr/bin/curl
Checking for zstd ...
/home/invain/anaconda3/envs/deepspeed/bin/zstd
Checking for docker ...
/usr/bin/docker
Enter number of GPUs [1]: 1
External port for the API [5000]:
Address for Triton [triton]:
Port of Triton host [8001]:
Where do you want to save your models [/work/qtlab/fauxpilot/models]?
Choose your backend:
[1] FasterTransformer backend (faster, but limited models)
[2] Python backend (slower, but more models, and allows loading with int8)
Enter your choice [1]: 2
Models available:
[1] codegen-350M-mono (1GB total VRAM required; Python-only)
[2] codegen-350M-multi (1GB total VRAM required; multi-language)
[3] codegen-2B-mono (4GB total VRAM required; Python-only)
[4] codegen-2B-multi (4GB total VRAM required; multi-language)
Enter your choice [4]: 2
Do you want to share your huggingface cache between host and docker container? y/n [n]: 2
Do you want to use int8? y/n [y]: y
Config written to /work/qtlab/fauxpilot/models/py-Salesforce-codegen-350M-multi/py-model/config.pbtxt
docker: 'compose' is not a docker command.
See 'docker --help'
Config complete, do you want to run FauxPilot? [y/n] y
  1. ./launch.sh
......................... Omission ......................
Attaching to fauxpilot-copilot_proxy-1, fauxpilot-triton-1
fauxpilot-triton-1         |
fauxpilot-triton-1         | =============================
fauxpilot-triton-1         | == Triton Inference Server ==
fauxpilot-triton-1         | =============================
fauxpilot-triton-1         |
fauxpilot-triton-1         | NVIDIA Release 22.06 (build 39726160)
fauxpilot-triton-1         | Triton Server Version 2.23.0
fauxpilot-triton-1         |
fauxpilot-triton-1         | Copyright (c) 2018-2022, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
fauxpilot-triton-1         |
fauxpilot-triton-1         | Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
fauxpilot-triton-1         |
fauxpilot-triton-1         | This container image and its contents are governed by the NVIDIA Deep Learning Container License.
fauxpilot-triton-1         | By pulling and using the container, you accept the terms and conditions of this license:
fauxpilot-triton-1         | https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
fauxpilot-copilot_proxy-1  | INFO:     Started server process [1]
fauxpilot-copilot_proxy-1  | INFO:     Waiting for application startup.
fauxpilot-copilot_proxy-1  | INFO:     Application startup complete.
fauxpilot-copilot_proxy-1  | INFO:     Uvicorn running on http://0.0.0.0:5000 (Press CTRL+C to quit)
fauxpilot-triton-1         |
fauxpilot-triton-1         | I0102 07:53:07.342335 88 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7fc820000000' with size 268435456
fauxpilot-triton-1         | I0102 07:53:07.342492 88 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
fauxpilot-triton-1         | I0102 07:53:07.343283 88 model_repository_manager.cc:1191] loading: py-model:1
fauxpilot-triton-1         | I0102 07:53:07.550677 88 python_be.cc:1774] TRITONBACKEND_ModelInstanceInitialize: py-model_0 (CPU device 0)
fauxpilot-triton-1         | Cuda available? True
fauxpilot-triton-1         | is_half: True, int8: True, auto_device_map: True
fauxpilot-triton-1         | Model Salesforce/codegen-350M-multi Loaded. Footprint: 545652736
fauxpilot-triton-1         | I0102 07:53:26.800350 88 model_repository_manager.cc:1345] successfully loaded 'py-model' version 1
fauxpilot-triton-1         | I0102 07:53:26.812968 88 server.cc:556]
fauxpilot-triton-1         | +------------------+------+
fauxpilot-triton-1         | | Repository Agent | Path |
fauxpilot-triton-1         | +------------------+------+
fauxpilot-triton-1         | +------------------+------+
fauxpilot-triton-1         |
fauxpilot-triton-1         | I0102 07:53:26.813087 88 server.cc:583]
fauxpilot-triton-1         | +---------+-------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
fauxpilot-triton-1         | | Backend | Path                                                  | Config                                                                                                                                                         |
fauxpilot-triton-1         | +---------+-------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
fauxpilot-triton-1         | | python  | /opt/tritonserver/backends/python/libtriton_python.so | {"cmdline":{"auto-complete-config":"false","min-compute-capability":"6.000000","backend-directory":"/opt/tritonserver/backends","default-max-batch-size":"4"}} |
fauxpilot-triton-1         | +---------+-------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
fauxpilot-triton-1         |
fauxpilot-triton-1         | I0102 07:53:26.818532 88 server.cc:626]
fauxpilot-triton-1         | +----------+---------+--------+
fauxpilot-triton-1         | | Model    | Version | Status |
fauxpilot-triton-1         | +----------+---------+--------+
fauxpilot-triton-1         | | py-model | 1       | READY  |
fauxpilot-triton-1         | +----------+---------+--------+
fauxpilot-triton-1         |
fauxpilot-triton-1         | I0102 07:53:26.954685 88 metrics.cc:650] Collecting metrics for GPU 0: NVIDIA TITAN Xp
fauxpilot-triton-1         | I0102 07:53:26.954864 88 tritonserver.cc:2159]
fauxpilot-triton-1         | +----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
fauxpilot-triton-1         | | Option                           | Value                                                                                                                                                                                        |
fauxpilot-triton-1         | +----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
fauxpilot-triton-1         | | server_id                        | triton                                                                                                                                                                                       |
fauxpilot-triton-1         | | server_version                   | 2.23.0                                                                                                                                                                                       |
fauxpilot-triton-1         | | server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics trace |
fauxpilot-triton-1         | | model_repository_path[0]         | /model                                                                                                                                                                                       |
fauxpilot-triton-1         | | model_control_mode               | MODE_NONE                                                                                                                                                                                    |
fauxpilot-triton-1         | | strict_model_config              | 1                                                                                                                                                                                            |
fauxpilot-triton-1         | | rate_limit                       | OFF                                                                                                                                                                                          |
fauxpilot-triton-1         | | pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                                    |
fauxpilot-triton-1         | | cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                                                     |
fauxpilot-triton-1         | | response_cache_byte_size         | 0                                                                                                                                                                                            |
fauxpilot-triton-1         | | min_supported_compute_capability | 6.0                                                                                                                                                                                          |
fauxpilot-triton-1         | | strict_readiness                 | 1                                                                                                                                                                                            |
fauxpilot-triton-1         | | exit_timeout                     | 30                                                                                                                                                                                           |
fauxpilot-triton-1         | +----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
fauxpilot-triton-1         |
fauxpilot-triton-1         | I0102 07:53:26.963493 88 grpc_server.cc:4587] Started GRPCInferenceService at 0.0.0.0:8001
fauxpilot-triton-1         | I0102 07:53:26.982220 88 http_server.cc:3303] Started HTTPService at 0.0.0.0:8000
fauxpilot-triton-1         | I0102 07:53:27.026420 88 http_server.cc:178] Started Metrics Service at 0.0.0.0:8002
fauxpilot-copilot_proxy-1  | [StatusCode.UNAVAILABLE] Request for unknown model: 'fastertransformer' is not found
fauxpilot-copilot_proxy-1  | Returned completion in 19.145488739013672 ms
fauxpilot-copilot_proxy-1  | INFO:     2023-01-02 07:53:47,940 :: 172.18.0.1:60666 - "POST /v1/engines/codegen/completions HTTP/1.1" 200 OK
fauxpilot-copilot_proxy-1  | [StatusCode.UNAVAILABLE] Request for unknown model: 'fastertransformer' is not found
fauxpilot-copilot_proxy-1  | Returned completion in 39.86477851867676 ms
fauxpilot-copilot_proxy-1  | INFO:     2023-01-02 07:58:39,684 :: 172.18.0.1:41450 - "POST /v1/engines/codegen/completions HTTP/1.1" 200 OK
fauxpilot-copilot_proxy-1  | [StatusCode.UNAVAILABLE] Request for unknown model: 'fastertransformer' is not found
fauxpilot-copilot_proxy-1  | Returned completion in 1.708984375 ms
fauxpilot-copilot_proxy-1  | INFO:     2023-01-02 07:58:47,537 :: 172.18.0.1:50756 - "POST /v1/engines/codegen/completions HTTP/1.1" 200 OK
  1. Run a client
curl -s -H "Accept: application/json" -H "Content-type: application/json" -X POST -d '{"prompt":"int hello(){","max_tokens":50,"temperature":0.1,"stop":["\n\n"]}' http://localhost:5000/v1/engines/codegen/completions

Further Information

  • I used the "Python backend" while running the "setup.sh" file.

I think you're sending the request for fastertransformer backend. That's what the error message says.

In the request body, set model to py-model. I guess I should document this somewhere.

Do you mean this?

However, the exeuction result with rest API is as follows. :(

(deepspeed) invain@mymate:/work/qtlab/fauxpilot$ curl -s -H "Accept: application/json" -H "Content-type: application/json" -X POST -d '{"model_name":"py-model","prompt":"int hello(){","max_tokens":50,"temperature":0.1,"stop":["\n\n"]}' http://localhost:5000/v1/engines/codegen/completions
{"id": "cmpl-DpyzbjCEdBOptUzfmvlgHdMtxG4D7", "choices": []}(deepspeed) invain@mymate:/work/qtlab/fauxpilot$
(deepspeed) invain@mymate:/work/qtlab/fauxpilot$
(deepspeed) invain@mymate:/work/qtlab/fauxpilot$
(deepspeed) invain@mymate:/work/qtlab/fauxpilot$ curl -s -H "Accept: application/json" -H "Content-type: application/json" -X POST -d '{"model-name":"py-model","prompt":"int hello(){","max_tokens":50,"temperature":0.1,"stop":["\n\n"]}' http://localhost:5000/v1/engines/codegen/completions
{"id": "cmpl-uDQI1w42r9rX4R6g5WTuZPfuaJbEo", "choices": []}(deepspeed) invain@mymate:/work/qtlab/fauxpilot$
(deepspeed) invain@mymate:/work/qtlab/fauxpilot$
(deepspeed) invain@mymate:/work/qtlab/fauxpilot$
(deepspeed) invain@mymate:/work/qtlab/fauxpilot$ curl -s -H "Accept: application/json" -H "Content-type: application/json" -X POST -d '{"model":"py-model","prompt":"int hello(){","max_tokens":50,"temperature":0.1,"stop":["\n\n"]}' http://localhost:5000/v1/engines/codegen/completions
{"id": "cmpl-d9YNLmAPJbVI5ZRczJmlrRIBpqza0", "choices": []}(deepspeed) invain@mymate:/work/qtlab/fauxpilot$
(deepspeed) invain@mymate:/work/qtlab/fauxpilot$
(deepspeed) invain@mymate:/work/qtlab/fauxpilot$

The main issue here is that the key should be model and not model_name. Here, the backend was setting model to the default value of 'fastertransformer'. (The server logs will show that).

#135 tries to address this (currently by showing a warning in the backend logs if user tries to access a non-existent model).

Additionally, the current code sets logprobs to 1 if not supplied (I feel this shouldn't be the case) and python backend doesn't yet support logprobs. #135 also addresses this.

Till the PR gets merged, the following curl request will work:

curl -s -H "Accept: application/json" -H "Content-type: application/json" -X POST -d '{"model":"py-model","prompt":"int hello(){","max_tokens":50,"temperature":0.1,"stop":["\n\n"], "logprobs": 0}' http://localhost:5000/v1/engines/codegen/completions

Till the PR gets merged, the following curl request will work:

curl -s -H "Accept: application/json" -H "Content-type: application/json" -X POST -d '{"model":"py-model","prompt":"int hello(){","max_tokens":50,"temperature":0.1,"stop":["\n\n"], "logprobs": 0}' http://localhost:5000/v1/engines/codegen/completions

It is odd. Could you provide us with the output of the command?
I encountered the following error with your comment: (e.g., logprobs:0)

  • Log messages:
(deepspeed) invain@mymate:/work/qtlab/fauxpilot$ curl -s -H "Accept: application/json" -H "Content-type: application/json" -X POST -d '{"model":"py-model","prompt":"int hello(){","max_tokens":50,"temperature":0.1,"stop":["\n\n"], "logprobs": 0}' http://localhost:5000/v1/engines/codegen/completions
{"id": "cmpl-gdYgnWiPT4nsAOX4DXrLnHgDpTEAa", "choices": []}(deepspeed) invain@mymate:/work/qtlab/fauxpilot$
(deepspeed) invain@mymate:/work/qtlab/fauxpilot$
(deepspeed) invain@mymate:/work/qtlab/fauxpilot$
(deepspeed) invain@mymate:/work/qtlab/fauxpilot$ tree  models/py-Salesforce-codegen-350M-multi/
models/py-Salesforce-codegen-350M-multi/
└── py-model
    ├── 1
    │   ├── __pycache__
    │   │   └── model.cpython-38.pyc
    │   └── model.py
    └── config.pbtxt

3 directories, 3 files

As I posted this issue, I'm looking for a reproducible way to get correct inference results with a "Python-backend".

@leemgs is this issue still occurring? I think #135 fixes it regardless of #137 (that's just changing the default-ness of int8 during installation)

Closing for now, please reopen if it's still an issue.

Yepp. Thanks.