New Flash Attention Error

Question

New Flash Attention Error

BBC-Esq opened this issue 3 months ago · comments

Trying to test Microsoft's Orca2 model based on Llama2, running in "int8," but I received this error that I've never seen before:

https://huggingface.co/microsoft/Orca-2-7b

    generator = ctranslate2.Generator(model_dir, device="cuda", compute_type="int8", flash_attention=True, intra_threads=intra_threads)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: FlashAttention only support fp16 and bf16 data type

Never encountered this error before despite running multiple other models converted to "int8"...

Minh-Thuc · Answer 1 · Fri Apr 26 2024 20:21:38 GMT+0800 (China Standard Time)

As mentioned in the error's message, the Flash attention support only float16 or bfloat16 models. The orca-2-7b has the type float32 which is not supported.

BBC-Esq · Answer 2 · Fri Apr 26 2024 20:24:29 GMT+0800 (China Standard Time)

I thought when I converted it from float32 into "int8" that it changed the quantization though...I'm not understanding the distinction...

Vincent Nguyen · Answer 3 · Fri Apr 26 2024 20:27:28 GMT+0800 (China Standard Time)

did you try compute_type=int8_float16 ?

Panos Kanavos · Answer 4 · Fri Apr 26 2024 20:27:44 GMT+0800 (China Standard Time)

During conversion, you have to explicitly set the conversion type to int8_float16

BBC-Esq · Answer 5 · Fri Apr 26 2024 20:32:39 GMT+0800 (China Standard Time)

I will definitely try that...I specifically converted it to "int8," which I thought addressed the float32 thing...will try and test...

Minh-Thuc · Answer 6 · Fri Apr 26 2024 20:37:23 GMT+0800 (China Standard Time)

In fact, in case of quantization model, it only helps us reduce the type when computing at linear layer (weight int8, scale float32, input float32 -> output float32). Otherwise at the flash attention layer, it requires the type of queries/keys/values (outputs of linear layers) passed in have to be float16 or blfoat16. It does not have any relationship with the quantization mentioned above.

BBC-Esq · Answer 7 · Fri Apr 26 2024 20:41:14 GMT+0800 (China Standard Time)

I see, so there's no point in trying int8_float16 since, if I understand correctly, the "quantization" into ctranslate2 format wouldn't affect the layers pertinent to flash attention...those, apparently, remain in float32, which is incompatible with flash attention.

BTW, I'd love to learn about model architectures so I could actually understand the words you just said...did you get my message with links where I asked about a good starting point?

BBC-Esq · Answer 8 · Fri Apr 26 2024 20:42:12 GMT+0800 (China Standard Time)

Asked and answered...closing this issue but please let me know regarding resources to start learning about LLM architectures so I can create a converter.