huggingface / text-generation-inference

Large Language Model Text Generation Inference

Home Page:http://hf.co/docs/text-generation-inference

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ImportError: libcuda.so.1: cannot open shared object file: No such file or directory

muhammadbaqir1327 opened this issue · comments

System Info

OS version: Ubuntu 22.04.3 LTS
Model: codellama/CodeLlama-13b-Instruct-hf

nvidia-smi output:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        On  | 00000000:21:00.0 Off |                  Off |
|  0%   24C    P8              13W / 450W |     16MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090        On  | 00000000:41:00.0 Off |                  Off |
|  0%   19C    P8               4W / 450W |    562MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

model=codellama/CodeLlama-13b-Instruct-hf
volume=$PWD/data

docker run --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model --quantize eetq

Expected behavior

The TGI should run. Because I am just copying official script and running it.

Logs

2024-06-14T10:50:09.534843Z INFO hf_hub: Token file not found "/root/.cache/huggingface/token"
2024-06-14T10:50:11.548549Z INFO text_generation_launcher: Model supports up to 16384 but tgi will now set its default to 4096 instead. This is to save VRAM by refusing large prompts in order to allow more users on the same hardware. You can increase that size using --max-batch-prefill-tokens=16434 --max-total-tokens=16384 --max-input-tokens=16383.
2024-06-14T10:50:11.548570Z INFO text_generation_launcher: Default max_input_tokens to 4095
2024-06-14T10:50:11.548575Z INFO text_generation_launcher: Default max_total_tokens to 4096
2024-06-14T10:50:11.548579Z INFO text_generation_launcher: Default max_batch_prefill_tokens to 4145
2024-06-14T10:50:11.548582Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2024-06-14T10:50:11.548733Z INFO download: text_generation_launcher: Starting download process.
2024-06-14T10:50:14.670106Z INFO text_generation_launcher: Files are already present on the host. Skipping download.

2024-06-14T10:50:15.057638Z INFO download: text_generation_launcher: Successfully downloaded weights.
2024-06-14T10:50:15.058027Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-06-14T10:50:17.213782Z WARN text_generation_launcher: Could not import Flash Attention enabled models: cannot import name 'FastLayerNorm' from 'text_generation_server.layers.layernorm' (/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/layernorm.py)

2024-06-14T10:50:17.861867Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
Traceback (most recent call last):

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/paged_attention.py", line 10, in
from vllm._C import cache_ops

ImportError: libcuda.so.1: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

File "/opt/conda/bin/text-generation-server", line 8, in
sys.exit(app())

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 71, in serve
from text_generation_server import server

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 17, in
from text_generation_server.models.pali_gemma import PaliGemmaBatch

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/pali_gemma.py", line 5, in
from text_generation_server.models.vlm_causal_lm import (

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/vlm_causal_lm.py", line 14, in
from text_generation_server.models.flash_mistral import (

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_mistral.py", line 18, in
from text_generation_server.models.custom_modeling.flash_mistral_modeling import (

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 30, in
from text_generation_server.utils import paged_attention, flash_attn

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/paged_attention.py", line 13, in
raise ImportError(

ImportError: Could not import vllm paged attention. Make sure your installation is correct. Complete error: libcuda.so.1: cannot open shared object file: No such file or directory
rank=0
Error: ShardCannotStart
2024-06-14T10:50:17.960026Z ERROR text_generation_launcher: Shard 0 failed to start
2024-06-14T10:50:17.960047Z INFO text_generation_launcher: Shutting down shards

Hey @muhammadbaqir1327, thanks for your report! Do you get the issue even with the newest container?

It would change your code as such:

model=codellama/CodeLlama-13b-Instruct-hf
volume=$PWD/data

- docker run --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model --quantize eetq
+ docker run --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0.4 --model-id $model --quantize eetq

Yes, I have already tried it
@LysandreJik

Ok, thanks! Let's try to see what's going on. This seems like a setup/CUDA issue, and I don't see you passing the GPUs to the docker image. Could you try by having a --gpus all additional flag?

So this command:

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0.4 --model-id $model --quantize eetq 

Yes, I have also tried that command. It also not worked for me.

I also think it is related to CUDA installation. I tried to run nvcc --version. But it said that command is not available.

You can check similar type issue on this link: https://forums.developer.nvidia.com/t/nvcc-command-not-found-and-unable-to-install-nvidia-cuda-toolkit-in-the-jetpack-6/275486

If nvcc --version doesn't work I'm not sure the problem lies in TGI unfortunately, it seems to be linked to an issue with your setup 😕

Ok, thanks! Let's try to see what's going on. This seems like a setup/CUDA issue, and I don't see you passing the GPUs to the docker image. Could you try by having a --gpus all additional flag?

So this command:

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0.4 --model-id $model --quantize eetq 

By running this command. It produced following logs:

2024-06-14T15:27:13.274448Z  INFO text_generation_launcher: Args {
    model_id: "codellama/CodeLlama-13b-Instruct-hf",
    revision: None,
    validation_workers: 2,
    sharded: None,
    num_shard: None,
    quantize: Some(
        Eetq,
    ),
    speculate: None,
    dtype: None,
    trust_remote_code: false,
    max_concurrent_requests: 128,
    max_best_of: 2,
    max_stop_sequences: 4,
    max_top_n_tokens: 5,
    max_input_tokens: None,
    max_input_length: None,
    max_total_tokens: None,
    waiting_served_ratio: 0.3,
    max_batch_prefill_tokens: None,
    max_batch_total_tokens: None,
    max_waiting_tokens: 20,
    max_batch_size: None,
    cuda_graphs: None,
    hostname: "c9facdfbc83e",
    port: 80,
    shard_uds_path: "/tmp/text-generation-server",
    master_addr: "localhost",
    master_port: 29500,
    huggingface_hub_cache: Some(
        "/data",
    ),
    weights_cache_override: None,
    disable_custom_kernels: false,
    cuda_memory_fraction: 1.0,
    rope_scaling: None,
    rope_factor: None,
    json_output: false,
    otlp_endpoint: None,
    cors_allow_origin: [],
    watermark_gamma: None,
    watermark_delta: None,
    ngrok: false,
    ngrok_authtoken: None,
    ngrok_edge: None,
    tokenizer_config_path: None,
    disable_grammar_support: false,
    env: false,
    max_client_batch_size: 4,
}
2024-06-14T15:27:13.274521Z  INFO hf_hub: Token file not found "/root/.cache/huggingface/token"    
2024-06-14T15:27:14.789216Z  INFO text_generation_launcher: Model supports up to 16384 but tgi will now set its default to 4096 instead. This is to save VRAM by refusing large prompts in order to allow more users on the same hardware. You can increase that size using `--max-batch-prefill-tokens=16434 --max-total-tokens=16384 --max-input-tokens=16383`.
2024-06-14T15:27:14.789235Z  INFO text_generation_launcher: Default `max_input_tokens` to 4095
2024-06-14T15:27:14.789240Z  INFO text_generation_launcher: Default `max_total_tokens` to 4096
2024-06-14T15:27:14.789244Z  INFO text_generation_launcher: Default `max_batch_prefill_tokens` to 4145
2024-06-14T15:27:14.789248Z  INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2024-06-14T15:27:14.789408Z  INFO download: text_generation_launcher: Starting download process.
2024-06-14T15:27:17.653212Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.

2024-06-14T15:27:18.095517Z  INFO download: text_generation_launcher: Successfully downloaded weights.
2024-06-14T15:27:18.095844Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-06-14T15:27:23.081981Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 90, in serve
    server.serve(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 257, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 220, in serve_inner
    model = get_model(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 560, in get_model
    return FlashLlama(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_llama.py", line 79, in __init__
    weights = Weights(filenames, device, dtype, process_group=self.process_group)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 24, in __init__
    with safe_open(filename, framework="pytorch") as f:
safetensors_rust.SafetensorError: Error while deserializing header: InvalidHeaderDeserialization

2024-06-14T15:27:23.607129Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

/opt/conda/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'CodeLlamaTokenizer'. 
The class this function is called from is 'LlamaTokenizer'.
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:658: UserWarning: You are using a Backend <class 'text_generation_server.utils.dist.FakeGroup'> as a ProcessGroup. This usage is deprecated since PyTorch 2.0. Please use a public API of PyTorch Distributed instead.
  warnings.warn(
Traceback (most recent call last):

  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 90, in serve
    server.serve(

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 257, in serve
    asyncio.run(

  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)

  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 220, in serve_inner
    model = get_model(

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 560, in get_model
    return FlashLlama(

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_llama.py", line 79, in __init__
    weights = Weights(filenames, device, dtype, process_group=self.process_group)

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 24, in __init__
    with safe_open(filename, framework="pytorch") as f:

safetensors_rust.SafetensorError: Error while deserializing header: InvalidHeaderDeserialization
 rank=0
2024-06-14T15:27:23.706513Z ERROR text_generation_launcher: Shard 0 failed to start
2024-06-14T15:27:23.706536Z  INFO text_generation_launcher: Shutting down shards

One point I forget to mention while running docker run, I was getting hf_transfer error. So by searching some solutions, I fixed it by adding HF_HUB_ENABLE_HF_TRANSFER=0 variable in the command, like this:

docker run --env HF_HUB_ENABLE_HF_TRANSFER=0 ...

@LysandreJik

Hmmm these are different problems. The safetensors_rust.SafetensorError: Error while deserializing header: InvalidHeaderDeserialization points to an error with the model you have loaded.

Is it possible for you to load it in Python directly using the safetensors library?

from safetensors import safe_open
import torch

model = hf_hub_download(repo_id=model_id, filename="model.safetensors")

tensors = {}
with safe_open(model, framework="pt", device="cpu") as f:
   for key in f.keys():
       tensors[key] = f.get_tensor(key)

This issue
ImportError: libcuda.so.1: cannot open shared object file: No such file or directory
was due to not passing --gpus flag in docker command.
As discussed here: https://stackoverflow.com/questions/54249577/importerror-libcuda-so-1-cannot-open-shared-object-file#comment136144707_68587460

@LysandreJik

Glad you could get it resolved!