triton-inference-server / triton_cli

Triton CLI is an open source command line interface that enables users to create, deploy, and profile models served by the Triton Inference Server.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Support for "chat" variant of Llama-2-7b model

IAINATDBI opened this issue · comments

Successfully ran inference with llama-2-7b. Can you confirm it's the llama2-7b-hf model that is pulled? From the logs it looks like it pulled that one from my cache.

Would the "chat" model not be better for a conversational inference experience? Can you configure what variant is pulled or is it "hard-coded" right now?

Cheers

Hi @IAINATDBI,

It's a bit messy right now and will be improved over time, but I'll try to answer based on current details.

Can you confirm it's the llama2-7b-hf model that is pulled?

The logs should show the HuggingFace ID being used where applicable, in this case it is looking for specifically meta-llama/Llama-2-7b-hf from HuggingFace. It will check your local HF cache to see if it has already been downloaded based on that identifier:

triton - INFO - Known model source found for 'llama-2-7b': 'hf:meta-llama/Llama-2-7b-hf'

Can you configure what variant is pulled or is it "hard-coded" right now?

The CLI tool itself its built with some of this extensibility in mind, exposing --source to help users specify custom or unofficially tested models not in the current list of "known models".

To elaborate a bit:

# Short-hand for "known models":
triton import -m llama-2-7b --backend tensorrtllm

# This is the same internally as running:
triton import -m llama-2-7b --source hf:meta-llama/Llama-2-7b-hf --backend tensorrtllm

# If specifying a --source, the name of the -m/--model arg is arbitrary:
triton import -m my-model --source hf:meta-llama/Llama-2-7b-hf --backend tensorrtllm

However, support for building TRT-LLM models requires some more special care than vLLM at this time, so TRT-LLM support through the triton CLI is currently restricted to a few well known models because we need to know how to convert the model weights/checkpoints to a TRT-LLM compatible format.

For vLLM, you should generally be able to setup any model that vLLM supports. So for your chat example, this should work fine for vLLM:

triton import -m my-llama --source hf:meta-llama/Llama-2-7b-chat-hf --backend vllm

For TRT-LLM, it would currently require a code change to support a new model. I was able to quickly check if the chat model would work, since it should share most logic with the base llama model for reference:

diff --git a/src/triton_cli/repository.py b/src/triton_cli/repository.py
index bd120d0..bbf3902 100644
--- a/src/triton_cli/repository.py
+++ b/src/triton_cli/repository.py
@@ -79,6 +79,9 @@ SUPPORTED_TRT_LLM_BUILDERS = {
     "meta-llama/Llama-2-7b-hf": {
         "hf_allow_patterns": ["*.safetensors", "*.json"],
     },
+    "meta-llama/Llama-2-7b-chat-hf": {
+        "hf_allow_patterns": ["*.safetensors", "*.json"],
+    },
     "gpt2": {
         "hf_allow_patterns": ["*.safetensors", "*.json"],
         "hf_ignore_patterns": ["onnx/*"],
diff --git a/src/triton_cli/trt_llm/builder.py b/src/triton_cli/trt_llm/builder.py
index e01913a..074b236 100644
--- a/src/triton_cli/trt_llm/builder.py
+++ b/src/triton_cli/trt_llm/builder.py
@@ -3,6 +3,7 @@ import subprocess
 
 CHECKPOINT_MODULE_MAP = {
     "meta-llama/Llama-2-7b-hf": "llama",
+    "meta-llama/Llama-2-7b-chat-hf": "llama",
     "facebook/opt-125m": "opt",
 }

and then this worked for me:

triton -v import -m my-chat-model --source hf:meta-llama/Llama-2-7b-chat-hf --backend tensorrtllm

Overall, we're still figuring out some details around TRT-LLM and how to make it easy for users to bring custom models, or easily understand how to contribute code changes to add support for new models. If you have any feedback, please let us know!

Hi @rmccorm4, @fpetrini15 - thank you for taking the time for the detailed discussion(s). I've been using Triton for a bit now and its been performing well, so this CLI addition is really great to hear.

I'm keen to use the chat variant so that I can get responses that make sense. Looking forward to hearing about great things to come with Triton!

Cheers

Thanks @IAINATDBI, we've marked down a feature request (DLIS-6367) to expand the llama model support for TRT-LLM to be a bit more generic/flexible. I'll modify the title of the issue to better reflect that and keep it open.

HI @rmccorm4 - I tried the vLLM approach as described above and I'm getting an error during triton start it looks like a cuda memory issue and it advises increasing gpu_memory_utilization:

Internal: ValueError: The model's max seq len (4096) is larger than the maximum number of tokens that can be stored in KV cache (624)

I increased that to 95% in the model.json but still fails with same error. It mentions KV cache size too.

I'm running on a dual RTX A6000 workstation (so 2 x 48GB) and we should be able to squeeze the 7B model in. However, I'm aware that you guys mentioned quantization during your talk but not aware what the parameter names etc are for this option. There was a flavor that this feature might be automatic? With quantization I can normally load the 70B variant without issues. Thank you for your help with this.

cheers

As a wild guess I tried "quantization":"gptq" in the model.json and it's now looking for a config file?

Cheers

Here's a shot of nvidia-smi just prior to the server shutting down.
nvidia

Hey @IAINATDBI, can you open a separate issue to discuss the memory issues and quantization further if needed?

To summarize a few quick points:

  1. A 7B model should fit in GPU memory no problem with a 48GB even without quantization, so I'm assuming I need more details here on the model, config, etc. to reproduce.
  2. For taking advantage of 2x GPUs or doing multi-gpu inference in general, I believe this will be fully controlled by the vLLM arguments provided in the config. Our model.json is just a representation of vLLM's AsyncEngineArgs, so you probably need to try something like tensor_parallel_size to 2 for 2 GPUs. I haven't had a chance to test this myself yet. You can see how we initialize the vLLM engine from these args here. CC @oandreeva-nv for viz
  3. For the quantization question, this is again just a vLLM detail that we passthrough to the vLLM APIs. So whatever documentation on vLLM side should apply here for those details. I believe there are also some pre-quantized models hosted on huggingface for popular models like Llama2 that may work directly as well. The quantization described in our GTC talk was based primarily around TensorRT-LLM.

Thanks @rmccorm4, Ill raise any further quant questions separately.

However for this thread, the only way I can get triton to start without failing (even with gpt2) on cuda memory issues (see above image (screenshot) with very large memory hungry stub) is to launch the docker, install triton-cli, load the model, start triton and then docker exec into the container (per #50 ) to issue an infer command. I've even tried it on a different machine.

Hope this helps.

Hi @IAINATDBI ,

We added support for llama-2-7b-chat as well as llama-3-8b and llama-3-8b-instruct for both vLLM and TRT-LLM in the latest release associated with Triton 24.04, please check it out: https://github.com/triton-inference-server/triton_cli/releases/tag/0.0.7.

If you're seeing other issues, please raise a separate issue. Closing this issue based on the title about adding "chat" variant support.