P40 with USE_FLASH_ATTENTION=False

Question

P40 with USE_FLASH_ATTENTION=False

ltm920716 opened this issue 13 days ago · comments

System Info

Linux k8s-node2 6.5.0-41-generic #41~22.04.2-Ubuntu SMP PREEMPT_DYNAMIC Mon Jun 3 11:32:55 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

NVIDIA-SMI 535.171.04 Driver Version: 535.171.04

P40

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

sudo nerdctl run -it --gpus '"device=1"' -e USE_FLASH_ATTENTION=False -p 8080:80 -v /data/models:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id /data/Meta-Llama-3-8B-Instruct

log
====>

2024-06-20T10:13:43.100144Z  INFO text_generation_launcher: Args {
    model_id: "/data/Meta-Llama-3-8B-Instruct",
    revision: None,
    validation_workers: 2,
    sharded: None,
    num_shard: None,
    quantize: None,
    speculate: None,
    dtype: None,
    trust_remote_code: false,
    max_concurrent_requests: 128,
    max_best_of: 2,
    max_stop_sequences: 4,
    max_top_n_tokens: 5,
    max_input_tokens: None,
    max_input_length: None,
    max_total_tokens: None,
    waiting_served_ratio: 0.3,
    max_batch_prefill_tokens: None,
    max_batch_total_tokens: None,
    max_waiting_tokens: 20,
    max_batch_size: None,
    cuda_graphs: None,
    hostname: "0.0.0.0",
    port: 80,
    shard_uds_path: "/tmp/text-generation-server",
    master_addr: "localhost",
    master_port: 29500,
    huggingface_hub_cache: Some(
        "/data",
    ),
    weights_cache_override: None,
    disable_custom_kernels: false,
    cuda_memory_fraction: 1.0,
    rope_scaling: None,
    rope_factor: None,
    json_output: false,
    otlp_endpoint: None,
    cors_allow_origin: [],
    watermark_gamma: None,
    watermark_delta: None,
    ngrok: false,
    ngrok_authtoken: None,
    ngrok_edge: None,
    tokenizer_config_path: None,
    disable_grammar_support: false,
    env: false,
    max_client_batch_size: 4,
}
2024-06-20T10:13:43.100237Z  INFO text_generation_launcher: Model supports up to 8192 but tgi will now set its default to 4096 instead. This is to save VRAM by refusing large prompts in order to allow more users on the same hardware. You can increase that size using `--max-batch-prefill-tokens=8242 --max-total-tokens=8192 --max-input-tokens=8191`.
2024-06-20T10:13:43.100246Z  INFO text_generation_launcher: Default `max_input_tokens` to 4095
2024-06-20T10:13:43.100250Z  INFO text_generation_launcher: Default `max_total_tokens` to 4096
2024-06-20T10:13:43.100252Z  INFO text_generation_launcher: Default `max_batch_prefill_tokens` to 4145
2024-06-20T10:13:43.100257Z  INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2024-06-20T10:13:43.100359Z  INFO download: text_generation_launcher: Starting download process.
2024-06-20T10:13:46.674563Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.

2024-06-20T10:13:47.304041Z  INFO download: text_generation_launcher: Successfully downloaded weights.
2024-06-20T10:13:47.304397Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-06-20T10:13:51.055838Z  WARN text_generation_launcher: Could not import Flash Attention enabled models: `USE_FLASH_ATTENTION` is false.

2024-06-20T10:13:52.009743Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

Traceback (most recent call last):

  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 71, in serve
    from text_generation_server import server

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 17, in <module>
    from text_generation_server.models.pali_gemma import PaliGemmaBatch

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/pali_gemma.py", line 5, in <module>
    from text_generation_server.models.vlm_causal_lm import (

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/vlm_causal_lm.py", line 14, in <module>
    from text_generation_server.models.flash_mistral import (

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_mistral.py", line 18, in <module>
    from text_generation_server.models.custom_modeling.flash_mistral_modeling import (

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 30, in <module>
    from text_generation_server.utils import paged_attention, flash_attn

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/flash_attn.py", line 13, in <module>
    raise ImportError("`USE_FLASH_ATTENTION` is false.")

ImportError: `USE_FLASH_ATTENTION` is false.
 rank=0
2024-06-20T10:13:52.107484Z ERROR text_generation_launcher: Shard 0 failed to start
2024-06-20T10:13:52.107511Z  INFO text_generation_launcher: Shutting down shards
Error: ShardCannotStart

ImportError: USE_FLASH_ATTENTION is false.

Expected behavior

run normally on P40

help please, thanks!

Lysandre Debut · Answer 1 · Wed Jun 26 2024 18:36:10 GMT+0800 (China Standard Time)

Hey @ltm920716, thanks for the report!

I believe this was fixed by the following PR #1986 from @Narsil.

You'll now get this instead:

WARN text_generation_launcher: Could not import Flash Attention enabled models: `USE_FLASH_ATTENTION` is false.

I believe this hasn't been released in a version yet, but you can use the docker image that was released with the commit of the PR above:

ghcr.io/huggingface/text-generation-inference:sha-06edde9

The resulting command would be:

sudo nerdctl run -it --gpus '"device=1"' -e USE_FLASH_ATTENTION=False -p 8080:80 -v /data/models:/data ghcr.io/huggingface/text-generation-inference:sha-06edde9 --model-id /data/Meta-Llama-3-8B-Instruct

Ltm · Answer 2 · Wed Jun 26 2024 21:25:38 GMT+0800 (China Standard Time)

thanks a lot, I will have a try!