How to run FA

Question

MrigankRaman opened this issue 3 months ago · comments

Thanks for supporting FA!! I was wondering where to find the code changes needed to be done to use FA

Minh-Thuc · Answer 1 · Thu Apr 25 2024 15:56:38 GMT+0800 (China Standard Time)

You just set the flash_attention parameter when create the Generator:

generator = ctranslate2.Generator(model_dir, device="cuda", flash_attention=True)

BBC-Esq · Answer 2 · Thu Apr 25 2024 16:02:01 GMT+0800 (China Standard Time)

You can also go to the documents and search for "flash" and it'll give you links to how to use it for the "generator" as well as other parts of CT2!