OpenNMT / CTranslate2

Fast inference engine for Transformer models

Home Page:https://opennmt.net/CTranslate2

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to run FA

MrigankRaman opened this issue · comments

Thanks for supporting FA!! I was wondering where to find the code changes needed to be done to use FA

You just set the flash_attention parameter when create the Generator:

generator = ctranslate2.Generator(model_dir, device="cuda", flash_attention=True)

You can also go to the documents and search for "flash" and it'll give you links to how to use it for the "generator" as well as other parts of CT2!
image