triton-inference-server / triton_cli

Triton CLI is an open source command line interface that enables users to create, deploy, and profile models served by the Triton Inference Server.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Error importing Llama3-8b

riosje opened this issue · comments

Hello guys, I'm trying to import Llama3-8b using the cli but is failing.
I would appreciate any advice to solve this issue.

Steps to re-produce

ENGINE_DEST_PATH=/models/models triton import -m llama-3-8b --backend tensorrtllm

ERROR

triton - INFO - Running 'python3 /usr/local/lib/python3.10/dist-packages/triton_cli/trt_llm/checkpoint_scripts/llama/convert_checkpoint.py --model_dir /models/models/llama-3-8b/hf_download --output_dir /models/models/llama-3-8b/hf_download/converted_weights --dtype=float16'-00004.safetensors:  95%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊         | 4.75G/5.00G [00:12<00:00, 408MB/s]
⠋ Building TRT-LLM engine for meta-llama/Meta-Llama-3-8B...[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052100
0.11.0.dev2024052100
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00,  2.80it/s]
⠹ Building TRT-LLM engine for meta-llama/Meta-Llama-3-8B...[05/21/2024-22:01:42] Some parameters are on the meta device device because they were offloaded to the cpu.
⠋ Building TRT-LLM engine for meta-llama/Meta-Llama-3-8B...Weights loaded. Total time: 00:00:02
⠴ Building TRT-LLM engine for meta-llama/Meta-Llama-3-8B...Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 447, in load
    param.value = weights[name]
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/parameter.py", line 125, in value
    v = self._regularize_value(v)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/parameter.py", line 154, in _regularize_value
    return torch_to_numpy(value)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_utils.py", line 49, in torch_to_numpy
    return x.detach().cpu().numpy()
NotImplementedError: Cannot copy out of meta tensor; no data!

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/triton_cli/trt_llm/checkpoint_scripts/llama/convert_checkpoint.py", line 473, in <module>
    main()
  File "/usr/local/lib/python3.10/dist-packages/triton_cli/trt_llm/checkpoint_scripts/llama/convert_checkpoint.py", line 465, in main
    convert_and_save_hf(args)
  File "/usr/local/lib/python3.10/dist-packages/triton_cli/trt_llm/checkpoint_scripts/llama/convert_checkpoint.py", line 394, in convert_and_save_hf
    execute(args.workers, [convert_and_save_rank] * world_size, args)
  File "/usr/local/lib/python3.10/dist-packages/triton_cli/trt_llm/checkpoint_scripts/llama/convert_checkpoint.py", line 419, in execute
    f(args, rank)
  File "/usr/local/lib/python3.10/dist-packages/triton_cli/trt_llm/checkpoint_scripts/llama/convert_checkpoint.py", line 380, in convert_and_save_rank
    llama = LLaMAForCausalLM.from_hugging_face(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/model.py", line 280, in from_hugging_face
    llama = convert.from_hugging_face(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 1337, in from_hugging_face
    llama.load(weights)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 449, in load
    raise RuntimeError(
RuntimeError: Encounter error 'Cannot copy out of meta tensor; no data!' for parameter 'transformer.layers.18.input_layernorm.weight'
triton - WARNING - TRT-LLM model creation failed: Command '['python3', '/usr/local/lib/python3.10/dist-packages/triton_cli/trt_llm/checkpoint_scripts/llama/convert_checkpoint.py', '--model_dir', '/models/models/llama-3-8b/hf_download', '--output_dir', '/models/models/llama-3-8b/hf_download/converted_weights', '--dtype=float16']' returned non-zero exit status 1.. Cleaning up...
triton - INFO - Removing model llama-3-8b at /root/models/llama-3-8b...
triton - INFO - Removing model preprocessing at /root/models/preprocessing...
triton - INFO - Removing model tensorrt_llm at /root/models/tensorrt_llm...
triton - INFO - Removing model postprocessing at /root/models/postprocessing...
triton - ERROR - Command '['python3', '/usr/local/lib/python3.10/dist-packages/triton_cli/trt_llm/checkpoint_scripts/llama/convert_checkpoint.py', '--model_dir', '/models/models/llama-3-8b/hf_download', '--output_dir', '/models/models/llama-3-8b/hf_download/converted_weights', '--dtype=float16']' returned non-zero exit status 1.

Hi @riosje,

Thanks for filing an issue! Can you share more details about your system, such as the amount of RAM it has, and the output of nvidia-smi prior to running this command?

I believe this error may occur when your GPU runs OOM during the import/build process.

Hi @rmccorm4 thanks for taking a look to this issue.

RAM

root@NVIDIA-A10:~/models# free -h
               total        used        free      shared  buff/cache   available
Mem:           216Gi       3.0Gi        77Gi        28Mi       135Gi       211Gi
Swap:             0B          0B          0B

GPU

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.4     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A10-12Q                 On  | 00000002:00:00.0 Off |                    0 |
| N/A   N/A    P8              N/A /  N/A |      0MiB / 12288MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

CPU

root@NVIDIA-A10:~/models# lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         48 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  18
  On-line CPU(s) list:   0-17
Vendor ID:               AuthenticAMD
  Model name:            AMD EPYC 74F3 24-Core Processor
    CPU family:          25
    Model:               1
    Thread(s) per core:  2
    Core(s) per socket:  9
    Socket(s):           1
    Stepping:            1
    BogoMIPS:            6388.08
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclm
                         ulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid r
                         dseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm
Virtualization features:
  Hypervisor vendor:     Microsoft
  Virtualization type:   full
Caches (sum of all):
  L1d:                   288 KiB (9 instances)
  L1i:                   288 KiB (9 instances)
  L2:                    4.5 MiB (9 instances)
  L3:                    32 MiB (1 instance)
NUMA:
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-17
Vulnerabilities:
  Gather data sampling:  Not affected
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Not affected
  Spec rstack overflow:  Vulnerable: Safe RET, no microcode
  Spec store bypass:     Vulnerable
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
  Srbds:                 Not affected
  Tsx async abort:       Not affected

Hi @riosje, thanks for sharing the info.

Most likely, the 12GB your GPU has is not enough. As a ballpark figure, a FP16/BF16 8B model should minimally require ~2*8 = 16GB memory just for the weights. Any additional memory afterwards would go towards things like KV Cache. I believe a general recommendation for Llama3-8B is a GPU with at least 24GB of memory, as even a 16GB GPU (ex: V100 16GB) may run into this error.

Technically it may be possible to build an engine with 8-bit weights (INT8 for Ampere, FP8 for newer generations like Ada/Hopper) for an 8B parameter model with 12GB or 16GB of GPU memory, but I think that would require some work with the TensorRT-LLM team to make sure it works or is supported.

CC @whoisj @matthewkotila who ran into similar issues.

If you're willing to experiment (as I don't have a 12/16GB GPU handy to test right now), you may try adding --load_model_on_cpu when calling convert_checkpoint.py here, to see if that helps since you have so much ample CPU memory on your system.

You can find the reference for that script/arg here.

Note you would need to clone and install the CLI from source via pip install /path/to/triton_cli after making your change locally.

Here's a related issue: NVIDIA/TensorRT-LLM#1440

Hi @rmccorm4 thanks for the detailed instructions, there is still an error, but I'm pretty sure is related to the capacity as you mentions, specifically for this error RuntimeError: No CUDA GPUs are available which I already faced a similar one using NeMo.

I will give it a try using an NVIDIA A100 with 80GB, but I'm still curious how projects like https://github.com/ollama/ollama solved to run this kind of models with small amount of resources, I'm being able to run llama3-8b on this VM using ollama (I understand they use other format), but I still don't understand what's the main difference or the drawbacks.

I Will try to follow the issues on NVIDIA/TensorRT-LLM

ERROR

root@NVIDIA-A10:~/models# ENGINE_DEST_PATH=/models/models triton import -m llama-3-8b --backend tensorrtllm
triton - INFO - Known model source found for 'llama-3-8b': 'hf:meta-llama/Meta-Llama-3-8B'
generation_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 177/177 [00:00<00:00, 2.50MB/s]
original/params.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 211/211 [00:00<00:00, 2.67MB/s]
model.safetensors.index.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 23.9k/23.9k [00:00<00:00, 168MB/s]
special_tokens_map.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 73.0/73.0 [00:00<00:00, 727kB/s]
config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 654/654 [00:00<00:00, 7.62MB/s]
tokenizer_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50.6k/50.6k [00:00<00:00, 112MB/s]
tokenizer.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.09M/9.09M [00:00<00:00, 45.1MB/s]
model-00004-of-00004.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.17G/1.17G [00:04<00:00, 290MB/s]
model-00003-of-00004.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.92G/4.92G [00:12<00:00, 393MB/s]
model-00002-of-00004.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5.00G/5.00G [00:12<00:00, 404MB/s]
model-00001-of-00004.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.98G/4.98G [00:15<00:00, 319MB/s]
Fetching 11 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:15<00:00,  1.42s/it]
triton - INFO - Running 'python3 /usr/local/lib/python3.10/dist-packages/triton_cli/trt_llm/checkpoint_scripts/llama/convert_checkpoint.py --model_dir /models/models/llama-3-8b/hf_download --output_dir /models/models/llama-3-8b/hf_download/converted_weights --dtype=float16 --load_model_on_cpu' 99%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 4.97G/5.00G [00:12<00:00, 470MB/s]
⠏ Building TRT-LLM engine for meta-llama/Meta-Llama-3-8B...[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052100
0.11.0.dev2024052100
⠏ Building TRT-LLM engine for meta-llama/Meta-Llama-3-8B.../usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:628: UserWarning: Can't initialize NVML
  warnings.warn("Can't initialize NVML")
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 13.74it/s]
⠸ Building TRT-LLM engine for meta-llama/Meta-Llama-3-8B...Weights loaded. Total time: 00:00:03
⠼ Building TRT-LLM engine for meta-llama/Meta-Llama-3-8B...Total time of converting checkpoints: 00:00:17
⠋ Building TRT-LLM engine for meta-llama/Meta-Llama-3-8B...triton - INFO - Running 'trtllm-build --checkpoint_dir=/models/models/llama-3-8b/hf_download/converted_weights --output_dir=/models/models/llama-3-8b --gpt_attention_plugin=float16 --gemm_plugin=float16'
⠙ Building TRT-LLM engine for meta-llama/Meta-Llama-3-8B...[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052100
/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:628: UserWarning: Can't initialize NVML
  warnings.warn("Can't initialize NVML")
[05/22/2024-02:31:35] [TRT-LLM] [I] Set bert_attention_plugin to float16.
[05/22/2024-02:31:35] [TRT-LLM] [I] Set gpt_attention_plugin to float16.
[05/22/2024-02:31:35] [TRT-LLM] [I] Set gemm_plugin to float16.
[05/22/2024-02:31:35] [TRT-LLM] [I] Set nccl_plugin to float16.
[05/22/2024-02:31:35] [TRT-LLM] [I] Set lookup_plugin to None.
[05/22/2024-02:31:35] [TRT-LLM] [I] Set lora_plugin to None.
[05/22/2024-02:31:35] [TRT-LLM] [I] Set moe_plugin to float16.
[05/22/2024-02:31:35] [TRT-LLM] [I] Set mamba_conv1d_plugin to float16.
[05/22/2024-02:31:35] [TRT-LLM] [I] Set context_fmha to True.
[05/22/2024-02:31:35] [TRT-LLM] [I] Set context_fmha_fp32_acc to False.
[05/22/2024-02:31:35] [TRT-LLM] [I] Set paged_kv_cache to True.
[05/22/2024-02:31:35] [TRT-LLM] [I] Set remove_input_padding to True.
[05/22/2024-02:31:35] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[05/22/2024-02:31:35] [TRT-LLM] [I] Set multi_block_mode to False.
[05/22/2024-02:31:35] [TRT-LLM] [I] Set enable_xqa to True.
[05/22/2024-02:31:35] [TRT-LLM] [I] Set attention_qk_half_accumulation to False.
[05/22/2024-02:31:35] [TRT-LLM] [I] Set tokens_per_block to 64.
[05/22/2024-02:31:35] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[05/22/2024-02:31:35] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[05/22/2024-02:31:35] [TRT-LLM] [I] Set multiple_profiles to False.
[05/22/2024-02:31:35] [TRT-LLM] [I] Set paged_state to True.
[05/22/2024-02:31:35] [TRT-LLM] [I] Set streamingllm to False.
[05/22/2024-02:31:35] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_size*max_input_len.
It may not be optimal to set max_num_tokens=max_batch_size*max_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads.
[05/22/2024-02:31:35] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.

Traceback (most recent call last):
  File "/usr/local/bin/trtllm-build", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 449, in main
    cluster_config = infer_cluster_config()
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/auto_parallel/cluster_info.py", line 531, in infer_cluster_config
    device_name = torch.cuda.get_device_name(torch.cuda.current_device())
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 787, in current_device
    _lazy_init()
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 302, in _lazy_init
    torch._C._cuda_init()
RuntimeError: No CUDA GPUs are available
triton - WARNING - TRT-LLM model creation failed: Command '['trtllm-build', '--checkpoint_dir=/models/models/llama-3-8b/hf_download/converted_weights', '--output_dir=/models/models/llama-3-8b', '--gpt_attention_plugin=float16', '--gemm_plugin=float16']' returned non-zero exit status 1.. Cleaning up...
triton - INFO - Removing model llama-3-8b at /root/models/llama-3-8b...
triton - INFO - Removing model preprocessing at /root/models/preprocessing...
triton - INFO - Removing model tensorrt_llm at /root/models/tensorrt_llm...
triton - INFO - Removing model postprocessing at /root/models/postprocessing...
triton - ERROR - Command '['trtllm-build', '--checkpoint_dir=/models/models/llama-3-8b/hf_download/converted_weights', '--output_dir=/models/models/llama-3-8b', '--gpt_attention_plugin=float16', '--gemm_plugin=float16']' returned non-zero exit status 1.

I was able to run it on an NVIDIA A100 with vllm, It takes 14GB just to load the the weights.
thanks @rmccorm4 for the help I'll keep researching about how to quantize the model to run it on the A10 VM.

I0522 16:00:55.753261 655 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x795e86000000' with size 268435456
I0522 16:00:55.755619 655 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I0522 16:00:55.760179 655 model_lifecycle.cc:469] loading: llama-3-8b:1
I0522 16:00:58.821263 655 python_be.cc:2404] TRITONBACKEND_ModelInstanceInitialize: llama-3-8b_0 (GPU device 0)
INFO 05-22 16:01:00 llm_engine.py:74] Initializing an LLM engine (v0.4.0.post1) with config: model='meta-llama/Meta-Llama-3-8B', tokenizer='meta-llama/Meta-Llama-3-8B', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 05-22 16:01:01 selector.py:51] Cannot use FlashAttention because the package is not found. Please install it for better performance.
INFO 05-22 16:01:01 selector.py:25] Using XFormers backend.
INFO 05-22 16:01:02 weight_utils.py:177] Using model weights format ['*.safetensors']
INFO 05-22 16:01:04 model_runner.py:104] Loading model weights took 14.9595 GB
INFO 05-22 16:01:05 gpu_executor.py:94] # GPU blocks: 11672, # CPU blocks: 2048
INFO 05-22 16:01:07 model_runner.py:791] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 05-22 16:01:07 model_runner.py:795] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 05-22 16:01:10 model_runner.py:867] Graph capturing finished in 3 secs.
I0522 16:01:10.421177 655 model_lifecycle.cc:835] successfully loaded 'llama-3-8b'
I0522 16:01:10.421347 655 server.cc:607]```

No problem! If you find anything useful to unblock you, please share it here for future readers 🙏