OpenNMT / CTranslate2

Fast inference engine for Transformer models

Home Page:https://opennmt.net/CTranslate2

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

New Flash Attention Error

BBC-Esq opened this issue · comments

Trying to test Microsoft's Orca2 model based on Llama2, running in "int8," but I received this error that I've never seen before:

https://huggingface.co/microsoft/Orca-2-7b

    generator = ctranslate2.Generator(model_dir, device="cuda", compute_type="int8", flash_attention=True, intra_threads=intra_threads)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: FlashAttention only support fp16 and bf16 data type

Never encountered this error before despite running multiple other models converted to "int8"...

As mentioned in the error's message, the Flash attention support only float16 or bfloat16 models. The orca-2-7b has the type float32 which is not supported.

I thought when I converted it from float32 into "int8" that it changed the quantization though...I'm not understanding the distinction...

did you try compute_type=int8_float16 ?

During conversion, you have to explicitly set the conversion type to int8_float16

I will definitely try that...I specifically converted it to "int8," which I thought addressed the float32 thing...will try and test...

In fact, in case of quantization model, it only helps us reduce the type when computing at linear layer (weight int8, scale float32, input float32 -> output float32). Otherwise at the flash attention layer, it requires the type of queries/keys/values (outputs of linear layers) passed in have to be float16 or blfoat16. It does not have any relationship with the quantization mentioned above.

I see, so there's no point in trying int8_float16 since, if I understand correctly, the "quantization" into ctranslate2 format wouldn't affect the layers pertinent to flash attention...those, apparently, remain in float32, which is incompatible with flash attention.

BTW, I'd love to learn about model architectures so I could actually understand the words you just said...did you get my message with links where I asked about a good starting point?

Asked and answered...closing this issue but please let me know regarding resources to start learning about LLM architectures so I can create a converter.