OpenNMT / CTranslate2

Fast inference engine for Transformer models

Home Page:

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

BENCHmarking new flash attention!

BBC-Esq opened this issue · comments

Congrats on Flash Attention in the latest version, or to be precise, in having your storage limit increased on so you could upload the release that was weeks ago. Here are some benchmarks for your edification!

Testing Setup

  • RTX 4090, Windows 10
  • Used the generate_batch method with various beam sizes. The model was unloaded and reloaded each time - i.e. no caching of static prompt. Tokens per second, however, is only calculated on compute time not model loading time.
  • Quantization int8 used. I DID NOT use float16/bfloat16 and then quant at runtime. Rather, I converted to int8 before runtime in order to remove that variable.
  • Tested 10 times using flash attention and not. The numbers are the averages of those 10 runs.


  • Small but noticeable increase in tokens per second at beam size = 1. Disappointing at first...
  • HOWEVER, as beam size increases, using flash attention DRAMATICALLY mitigates the typical decrease in tokens per second associated with more beams.
  • MOREOVER, using flash attention DRAMATICALLY mitigates the increase in VRAM usage associated with more beams.


  • This identical SOLAR model in GGUF Q8_0 sits at 40.92 tokens per second using 14,541.75 MB of VRAM. It's been clear that GGUF always uses more VRAM the the Q8/int8 comparison. Moreover, GGUF remains faster...SORT OF...
  • GGUF does not have a beam size parameter. Transformers + bitsandbytes does (but I'll leave that for another day).
  • At beam size 1 and using flash attention, the tokens per second increased by roughly 18% and the required VRAM decreased by approximately 4%. However, the benefits of flash attention are more pronounced as you increase beam size.
  • THEREFORE, you can get a higher degree of accuracy with a smaller model by increasing beam size.

For example:

In my RAG application I love the Zephyr 3b model. If I put beams at 5, it religiously produces the same quality of results as my other favorite for RAG: neural-chat 7B (using a single beam). However, using 5 beams with the Zephyr 3b model puts its VRAM usage to MORE than the neural-chat model (again, which uses 1 beam). Therefore, there is no point in using Zephyr 3b since I can simply use neural-chat 7b.

Now, however, I can can use Zephyr 3B with 5 beams to achieve the same quality of neural-chat 7b (with 1 beam), but with less VRAM.

In conclusion, GGUF is "sort of" faster because it's only faster if you disregard Ctranslate2's ability to boost the quality of smaller models to one tier up by using beam size + flash attention, which, in the process, dramatically reduces VRAM usage and will inevitably increase tokens per second as well.

Thus, it's not appropriate to do a direct comparison of int8 and GGUF Q8_0 as of late. Here's a graph for you visual people. Feel free to contact me with any questions! Also, this is only a hobby so I'm always open to critiques.


Completed graphs below. A few highlights:

  1. For unknown reasons, neural chat without flash attention has an unexpectedly high tokens per second at beam size 4, although its VRAM usage still climbs as expected. I tested and re-tested this...

  2. Llama2 had - by far - the smallest difference in VRAM usage between using flash attention and not. I tested and re-tested...Additional testing of llama2 models is needed to determine if this is uniquely due to the llama2 architecture. If so, it might impact similar models like Phi-3.

  3. Flash attention actually makes llama2 7b have LESS tokens per second with beam size 2, but this is still within the margin of error in my experience.

  4. There was a token per second benefit for Mistral at beam size 2-4 but very little at 1 and 5...despite VRAM usage diverging as expected when using/not using flash attention. This could be a unique characteristic of Mistral and indicate ideal beam sizes for it...



Hello, What is the average seq_length in your benchmark? The flash attention have a better performance for the long prompt only.

Hello, What is the average seq_length in your benchmark? The flash attention have a better performance for the long prompt only.

Great question! I don't know if by seq_length you mean my question to the LLM or the LLM's response. For purposes of my answer I'm assuming the LLM's response, but let me know.

First, I need to clarify my testing methodology. My specific use-case is RAG-related. For my test, I'm feeding the model the following question:

Only base your answer to the following question on the provided context/contexts accompanying this question. If you cannot answer based on the included context/contexts alone, please state so. || My question is: What is the deadline to hold a preliminary protective hearing in a dependency case? || And here are the relevant contexts to base your answer off of: || Context 1 | From File: Georgia Juvenile Law Practice and Procedure - August 2022.pdf | § 6:21. Time limits—Preliminary protective hearing, Ga. Juv. Prac. & Proc. § 6:21 © 2022 Thomson Reuters. No claim to original U.S. Government Works. 1 Ga. Juv. Prac. & Proc. § 6:21 Georgia Juvenile Practice and Procedure with Forms | August 2022 Update Mark H. Murphy Chapter 6. Dependency Proceedings § 6:21. Time limits—Preliminary protective hearing If a child alleged to be dependent is removed from her home and is not returned home, the preliminary protective hearing must be held promptly and not later than 72 hours after the child is placed in foster care.1 If the 72-hour time period expires on a weekend or legal holiday, then the court is required to hold the hearing on the next day which is not a weekend or legal holiday.2 || Context 2 | From File: Georgia Juvenile Law Practice and Procedure - August 2022.pdf | § 6:35. Preliminary protective hearing—Dependency petition..., Ga. Juv. Prac. & Proc.... © 2022 Thomson Reuters. No claim to original U.S. Government Works. 1 Ga. Juv. Prac. & Proc. § 6:35 Georgia Juvenile Practice and Procedure with Forms | August 2022 Update Mark H. Murphy Chapter 6. Dependency Proceedings § 6:35. Preliminary protective hearing—Dependency petition if child not returned home Under O.C.G.A. § 15-11-145(g), if the child is not released at the preliminary protective hearing, a petition for dependency shall be made and presented to the court within five days of such hearing. Westlaw. © 2022 Thomson Reuters. No Claim to Orig. U.S. Govt. Works. End of Document © 2022 Thomson Reuters. No claim to original U.S. Government Works. || Context 3 | From File: Georgia Juvenile Law Practice and Procedure - August 2022.pdf | § 6:29. Preliminary protective hearing, Ga. Juv. Prac. & Proc. § 6:29 © 2022 Thomson Reuters. No claim to original U.S. Government Works. 1 Ga. Juv. Prac. & Proc. § 6:29 Georgia Juvenile Practice and Procedure with Forms | August 2022 Update Mark H. Murphy Chapter 6. Dependency Proceedings § 6:29. Preliminary protective hearing The preliminary protective hearing is essentially a probable cause hearing designed to provide prompt judicial oversight of state intervention into the constitutionally protected parent-child relationship. It must be held promptly after a child is removed from the home and is intended to provide due process to the parties involved. The hearing must occur no later than 72 hours after.

The perfect answer to this question must mention two things (and two things only):
(1) the deadline to hold a preliminary protective hearing is within 72 hours and (2) if that ends on a weekend or holiday, it's the next day.

I consider all responses that address these (and only these) two points as perfect in quality even if different models answer slightly differently. I.e., I'm not judging on commas or syntax, just substance.

However, if a model's response includes extra information that does not directly answer my question, even if the extra information is accurate based on the contexts, there's a slight deduction for "quality." These kinds of responses are not "incorrect" in that they give false information nor are it "incomplete;" for example, by mentioning 72 hours but not what happens if it falls on a weekend or legal holiday. Obviously, there's a serious reduction if factually false info is included. If any models produce factually inaccurate information I remove them from my testing altogether. For example, if a GGUF Q_2 produces false information I'll remove it but still keep the Q8_0 quant as long as it doesn't...

With that being said, and to ACTUALLY answer your question, "it depends.

I'll give you some sample responses and you tell me if they're long form or short form (and hence not fair to test FA with). NOTE: All of these were produced by GGUF models since I haven't updated my ctranslate2 script to also save all to a single .txt file, but ctranslate2's responses are basically the same. It's just easier to paste some GGUF responses for you right now...


The deadline for holding a preliminary protective hearing in a dependency case in Georgia is 72 hours after the child is placed in foster care, unless the court has already held such hearing on a weekend or legal holiday. If the 72-hour time period expires on a weekend or legal holiday, then the court must hold the hearing on the next day which is not a weekend or legal holiday. The deadline for filing a petition for dependency in Georgia under O.C.G.A. § 15-11-145(g) is within five days of the preliminary protective hearing. This information is based on the provided context/contexts, and it outlines the time frame for holding a preliminary protective hearing in a dependency case in Georgia.

The deadline for holding a preliminary protective hearing in a dependency case in Georgia is 72 hours after the child is placed in foster care, unless the court has already held such hearing and the time period has expired on a weekend or legal holiday. If the 72-hour time period expires on a weekend or legal holiday, then the court must hold the hearing on the next day which is not a weekend or legal holiday. The deadline for holding a preliminary protective hearing in a dependency case can vary depending on the specific provisions of the Georgia Juvenile Law Practice and Procedure with Forms document, as well as any relevant court rules or procedures. It's important to note that these deadlines are based on the specific requirements outlined in the provided context, which may differ from one jurisdiction to another.

The deadline for holding a preliminary protective hearing in a dependency case in Georgia is 72 hours after the child is placed in foster care or when a petition for dependency is filed. This time limit applies regardless of whether the child is released at the initial protective hearing or not. The purpose of the preliminary hearing is to provide judicial oversight and ensure that state intervention into the parent-child relationship is conducted in accordance with the constitutional protections provided by the court. If the child is not returned home, a petition for dependency may be filed within five days after the initial protective hearing. This process is designed to provide prompt judicial oversight of state intervention and protect the rights of the parties involved in the dependency case.


The deadline to hold a preliminary protective hearing in a dependency case in Georgia is 72 hours after the child is placed in foster care. This time limit applies even if the 72-hour period expires on a weekend or legal holiday, and the court must hold the hearing on the next day which is not a weekend or legal holiday. The purpose of the preliminary protective hearing is to provide prompt judicial oversight of state intervention into the constitutionally protected parent-child relationship and ensure due process for all parties involved. If the child is not released at the preliminary protective hearing, a petition for dependency shall be made and presented to the court within five days of such hearing.

The deadline to hold a preliminary protective hearing in a dependency case in Georgia is 72 hours after the child is placed in foster care, unless the time period expires on a weekend or legal holiday. If the 72-hour time period expires on a weekend or legal holiday, then the court must hold the hearing on the next day which is not a weekend or legal holiday. The deadline to file a petition for dependency if the child is not returned home under O.C.G.A. § 15-11-145(g) is within five days of the preliminary protective hearing.

The deadline to hold a preliminary protective hearing in a dependency case in Georgia is 72 hours after the child is placed in foster care. This time limit applies even if the 72-hour period expires on a weekend or legal holiday, and the court must hold the hearing on the next day that is not a weekend or legal holiday. If the child is not released at the preliminary protective hearing, a petition for dependency shall be made and presented to the court within five days of such hearing. The purpose of the preliminary protective hearing is to provide prompt judicial oversight of state intervention into the constitutionally protected parent-child relationship and ensure due process to the parties involved.


The deadline to hold a preliminary protective hearing in a dependency case in Georgia is 72 hours after the child is placed in foster care, unless the court extends the time period on a weekend or legal holiday. If the 72-hour time period expires on a weekend or legal holiday, then the court must hold the hearing on the next day which is not a weekend or legal holiday.

The deadline to hold a preliminary protective hearing in a dependency case in Georgia is 72 hours after the child is placed in foster care, unless the court extends the time period on a weekend or legal holiday. If the 72-hour time period expires on a weekend or legal holiday, then the court must hold the hearing on the next day which is not a weekend or legal holiday.

The deadline to hold a preliminary protective hearing in a dependency case in Georgia is 72 hours after the child is placed in foster care, unless the court extends the time period on a weekend or legal holiday. If the 72-hour time period expires on a weekend or legal holiday, then the court must hold the hearing on the next day which is not a weekend or legal holiday.


The deadline to hold a preliminary protective hearing in a dependency case in Georgia is 72 hours after the child is placed in foster care, unless the 72-hour time period expires on a weekend or legal holiday. If the 72-hour time period expires on a weekend or legal holiday, then the court must hold the hearing on the next day which is not a weekend or legal holiday.

The deadline to hold a preliminary protective hearing in a dependency case in Georgia is 72 hours after the child is placed in foster care. This time limit applies even if the 72-hour period expires on a weekend or legal holiday, and the court must hold the hearing on the next day that is not a weekend or legal holiday. The purpose of the preliminary protective hearing is to provide prompt judicial oversight of state intervention into the constitutionally protected parent-child relationship and ensure due process for all parties involved. If the child is not released at the preliminary protective hearing, a petition for dependency must be made and presented within five days of such hearing.

The deadline to hold a preliminary protective hearing in a dependency case in Georgia is 72 hours after the child is placed in foster care, unless the 72-hour time period expires on a weekend or legal holiday. If the 72-hour time period expires on a weekend or legal holiday, then the court must hold the hearing on the next day which is not a weekend or legal holiday.


The deadline to hold a preliminary protective hearing in a dependency case in Georgia is 72 hours after the child is placed in foster care, unless the 72-hour time period expires on a weekend or legal holiday. If the 72-hour time period expires on a weekend or legal holiday, then the court must hold the hearing on the next day which is not a weekend or legal holiday.

The deadline to hold a preliminary protective hearing in a dependency case in Georgia is 72 hours after the child is placed in foster care. This time limit applies even if the 72-hour period expires on a weekend or legal holiday, and the court must hold the hearing on the next day which is not a weekend or legal holiday. The purpose of the preliminary protective hearing is to provide prompt judicial oversight of state intervention into the constitutionally protected parent-child relationship and ensure due process for all parties involved.

The deadline to hold a preliminary protective hearing in a dependency case in Georgia is 72 hours after the child is placed in foster care, unless the 72-hour time period expires on a weekend or legal holiday. If the 72-hour time period expires on a weekend or legal holiday, then the court must hold the hearing on the next day which is not a weekend or legal holiday.


The deadline to hold a preliminary protective hearing in a dependency case in Georgia is 72 hours after the child is placed in foster care. This time limit applies even if the 72-hour period expires on a weekend or legal holiday, and the court must hold the hearing on the next day that is not a weekend or legal holiday. The purpose of the preliminary protective hearing is to provide prompt judicial oversight of state intervention into the constitutionally protected parent-child relationship and ensure due process for all parties involved.

The deadline to hold a preliminary protective hearing in a dependency case in Georgia is 72 hours after the child is placed in foster care. This time limit applies even if the 72-hour period expires on a weekend or legal holiday, and the court must hold the hearing on the next day that is not a weekend or legal holiday. The purpose of the preliminary protective hearing is to provide prompt judicial oversight of state intervention into the constitutionally protected parent-child relationship and ensure due process for all parties involved.

The deadline to hold a preliminary protective hearing in a dependency case in Georgia is 72 hours after the child is placed in foster care. This time limit applies even if the 72-hour period expires on a weekend or legal holiday, and the court must hold the hearing on the next day that is not a weekend or legal holiday. The purpose of the preliminary protective hearing is to provide prompt judicial oversight of state intervention into the constitutionally protected parent-child relationship and ensure due process for all parties involved.


The deadline to hold a preliminary protective hearing in a dependency case in Georgia is 72 hours after the child is placed in foster care. This time limit applies even if the 72-hour period expires on a weekend or legal holiday, and the court must hold the hearing on the next day that is not a weekend or legal holiday. The purpose of the preliminary protective hearing is to provide prompt judicial oversight of state intervention into the constitutionally protected parent-child relationship and ensure due process for all parties involved. If the child is not released at the preliminary protective hearing, a petition for dependency must be made and presented within five days of such hearing.

The deadline to hold a preliminary protective hearing in a dependency case in Georgia is 72 hours after the child is placed in foster care, unless the 72-hour time period expires on a weekend or legal holiday. If the 72-hour time period expires on a weekend or legal holiday, then the court must hold the hearing on the next day which is not a weekend or legal holiday.

The deadline to hold a preliminary protective hearing in a dependency case in Georgia is 72 hours after the child is placed in foster care, unless the 72-hour time period expires on a weekend or legal holiday. If the 72-hour time period expires on a weekend or legal holiday, then the court must hold the hearing on the next day which is not a weekend or legal holiday.


The deadline to hold a preliminary protective hearing in a dependency case in Georgia is 72 hours after the child is placed in foster care. This time limit applies even if the 72-hour period expires on a weekend or legal holiday, and the court must hold the hearing on the next day that is not a weekend or legal holiday. The purpose of the preliminary protective hearing is to provide prompt judicial oversight of state intervention into the constitutionally protected parent-child relationship and ensure due process for all parties involved.

The deadline to hold a preliminary protective hearing in a dependency case in Georgia is 72 hours after the child is placed in foster care, unless the 72-hour time period expires on a weekend or legal holiday. If the 72-hour time period expires on a weekend or legal holiday, then the court must hold the hearing on the next day which is not a weekend or legal holiday.

The deadline to hold a preliminary protective hearing in a dependency case in Georgia is 72 hours after the child is placed in foster care, unless the 72-hour time period expires on a weekend or legal holiday. If the 72-hour time period expires on a weekend or legal holiday, then the court must hold the hearing on the next day which is not a weekend or legal holiday.


What is the deadline to hold a preliminary protective hearing in a dependency case?

The deadline to hold a preliminary protective hearing in a dependency case is 72 hours after the child is removed from her home and returned home.

The deadline to hold a preliminary protective hearing in a dependency case is 72 hours after the child is placed in foster care.


The deadline for holding a preliminary protective hearing in a dependency case is 72 hours after the child is placed in foster care.

The deadline for holding a preliminary protective hearing in a dependency case is 72 hours after the child is placed in foster care.

The deadline for holding a preliminary protective hearing in a dependency case is 72 hours after the child is placed in foster care.


The deadline for holding a preliminary protective hearing in a dependency case is 72 hours after the child is placed in foster care. If the 72-hour time period expires on a weekend or legal holiday, then the court is required to hold the hearing on the next day which is not a weekend or legal holiday.

The deadline for holding a preliminary protective hearing in a dependency case is 72 hours after the child is placed in foster care. If the 72-hour time period expires on a weekend or legal holiday, then the court is required to hold the hearing on the next day which is not a weekend or legal holiday.

The deadline for holding a preliminary protective hearing in a dependency case is 72 hours after the child is placed in foster care.


The deadline for holding a preliminary protective hearing in a dependency case is 72 hours after the child is placed in foster care.

The deadline for holding a preliminary protective hearing in a dependency case is 72 hours after the child is placed in foster care.

The deadline for holding a preliminary protective hearing in a dependency case is 72 hours after the child is placed in foster care.


The deadline for holding a preliminary protective hearing in a dependency case is 72 hours after the child is placed in foster care.

The deadline for holding a preliminary protective hearing in a dependency case is 72 hours after the child is placed in foster care.

The deadline for holding a preliminary protective hearing in a dependency case is 72 hours after the child is placed in foster care.


The deadline for holding a preliminary protective hearing in a dependency case is 72 hours after the child is placed in foster care.

The deadline for holding a preliminary protective hearing in a dependency case is 72 hours after the child is placed in foster care.

The deadline for holding a preliminary protective hearing in a dependency case is 72 hours after the child is placed in foster care.


The deadline for holding a preliminary protective hearing in a dependency case is 72 hours after the child is placed in foster care.

The deadline for holding a preliminary protective hearing in a dependency case is 72 hours after the child is placed in foster care.

The deadline for holding a preliminary protective hearing in a dependency case is 72 hours after the child is placed in foster care.


The deadline for holding a preliminary protective hearing in a dependency case is 72 hours after the child is placed in foster care.

The deadline for holding a preliminary protective hearing in a dependency case is 72 hours after the child is placed in foster care.

The deadline for holding a preliminary protective hearing in a dependency case is 72 hours after the child is placed in foster care.

 The deadline to hold a preliminary protective hearing in a dependency case in Georgia, according to the provided contexts, is within 72 hours after the child is placed in foster care. If this time period expires on a weekend or legal holiday, the court must hold the hearing on the next day that isn't a weekend or legal holiday.

 The deadline to hold a preliminary protective hearing in a dependency case in Georgia, according to the provided contexts, is within 72 hours after the child is placed in foster care. If this time period expires on a weekend or legal holiday, the court must hold the hearing on the next day that isn't a weekend or legal holiday.

 The deadline to hold a preliminary protective hearing in a dependency case in Georgia, according to the provided contexts, is within 72 hours after the child is placed in foster care. If this time period expires on a weekend or legal holiday, the court must hold the hearing on the next day that isn't a weekend or legal holiday.


 The deadline to hold a preliminary protective hearing in a dependency case is that it must be held promptly and not later than 72 hours after the child is placed in foster care. If the 72-hour time period expires on a weekend or legal holiday, then the court is required to hold the hearing on the next day which is not a weekend or legal holiday.

 The deadline to hold a preliminary protective hearing in a dependency case is that it must be held promptly and not later than 72 hours after the child is placed in foster care. If the 72-hour time period expires on a weekend or legal holiday, then the court is required to hold the hearing on the next day which is not a weekend or legal holiday.

 The deadline to hold a preliminary protective hearing in a dependency case is that it must be held promptly and not later than 72 hours after the child is placed in foster care. If the 72-hour time period expires on a weekend or legal holiday, then the court is required to hold the hearing on the next day which is not a weekend or legal holiday.


 In a dependency case, the deadline to hold a preliminary protective hearing is that it must be held promptly and not later than 72 hours after the child is placed in foster care if they are not returned home. If the 72-hour time period expires on a weekend or legal holiday, then the court has to hold the hearing on the next day which isn't a weekend or legal holiday.

 The deadline to hold a preliminary protective hearing in a dependency case, as mentioned in the contexts provided, is that it must be held promptly and not later than 72 hours after the child is placed in foster care if they are not returned home. If the 72-hour time period expires on a weekend or legal holiday, then the court is required to hold the hearing on the next day which is not a weekend or legal holiday.

 In a dependency case, the deadline to hold a preliminary protective hearing is that it must be held promptly and not later than 72 hours after the child is placed in foster care. If the 72-hour time period expires on a weekend or legal holiday, then the court is required to hold the hearing on the next day which is not a weekend or legal holiday.


 The deadline to hold a preliminary protective hearing in a dependency case, as per Georgia Juvenile Law Practice and Procedure - August 2022, is that it must be held promptly and not later than 72 hours after the child is placed in foster care. If the 72-hour time period expires on a weekend or legal holiday, then the court is required to hold the hearing on the next day which is not a weekend or legal holiday.

 The deadline to hold a preliminary protective hearing in a dependency case, as per Georgia Juvenile Law Practice and Procedure - August 2022, is that it must be held promptly and not later than 72 hours after the child is placed in foster care. If the 72-hour time period expires on a weekend or legal holiday, then the court is required to hold the hearing on the next day which is not a weekend or legal holiday.

 The deadline to hold a preliminary protective hearing in a dependency case, as per Georgia Juvenile Law Practice and Procedure - August 2022, is that it must be held promptly and not later than 72 hours after the child is placed in foster care. If the 72-hour time period expires on a weekend or legal holiday, then the court is required to hold the hearing on the next day which is not a weekend or legal holiday.


 The deadline to hold a preliminary protective hearing in a dependency case, according to the provided contexts, is within 72 hours after the child is placed in foster care, unless it expires on a weekend or legal holiday, then the court must hold the hearing on the next day which is not a weekend or legal holiday.

 The deadline to hold a preliminary protective hearing in a dependency case, according to the provided contexts, is within 72 hours after the child is placed in foster care, unless it falls on a weekend or legal holiday, in which case the court must hold the hearing on the next day that isn't a weekend or legal holiday.

 The deadline to hold a preliminary protective hearing in a dependency case, according to the provided contexts, is within 72 hours after the child is placed in foster care, unless it falls on a weekend or legal holiday, in which case it must be held on the next day that isn't a weekend or legal holiday.


 The deadline to hold a preliminary protective hearing in a dependency case in Georgia, according to the provided contexts, is within 72 hours after the child is placed in foster care, unless it falls on a weekend or legal holiday, then it must be held on the next day which isn't a weekend or legal holiday.

 The deadline to hold a preliminary protective hearing in a dependency case, according to the provided contexts, is within 72 hours after the child is placed in foster care. If this time period expires on a weekend or legal holiday, the court must hold the hearing on the next day which is not a weekend or legal holiday.

 The deadline to hold a preliminary protective hearing in a dependency case in Georgia, according to the provided contexts, is within 72 hours after the child is placed in foster care. If this time period expires on a weekend or legal holiday, the court must hold the hearing on the next day that isn't a weekend or legal holiday.


 In Georgia, for a dependency case, the preliminary protective hearing must be held within 72 hours after the child is placed in foster care. If this time frame expires on a weekend or legal holiday, the court is required to hold the hearing on the next day that isn't a weekend or legal holiday.

 The deadline to hold a preliminary protective hearing in a dependency case in Georgia, according to the provided contexts, is within 72 hours after the child is placed in foster care. If this time frame expires on a weekend or legal holiday, the court must hold the hearing on the next day that isn't a weekend or legal holiday.

 In Georgia, for a dependency case, the preliminary protective hearing must be held within 72 hours after the child is placed in foster care. If this time period expires on a weekend or legal holiday, the court will hold the hearing on the next day that isn't a weekend or legal holiday.

Again, you tell me if these are long form or short form...or if that actually refers to my question to the LLM...

Apart from RAG applications, it's virtually impossible to test for "quality" because it's so diverse and subjective. One short story written by neural-chat might be great but the next one crap because the sampling took a wrong turn somewhere...or you just like the writing style of one or the other...With RAG and definitive facts, you can assess whether the answer is complete or not...

I'm not interested in "long-form" testing if you're referring to "write me a short story" and then calculating the tokens per second. I apologize if I wasn't as clear about this up front, but my testing is specifically geared towards RAG APPLICATIONS.

In conclusion, I will say that there is always runtime variations and it's impossible to control all variables. Even if I shutdown and restart my computer from scratch, one reboot will show VRAM usage as being .1 GB higher/lower...stuff like that.

Please anyone seeing my testing, take them with a grain of salt or as a rule of thumb regarding general trends, and MAKE SURE TO TEST YOURSELF!




P.S. I'd love to hear your expert opinion regarding how - longform or short form - I'm still seeing a huge benefit for using a higher beam size. That, to me, was the most interesting finding!

I tested all of the models yet again this morning one right after the other, without opening closing other programs, etc. Here's the results...Seems to confirm yet again the unique behavior of llama2-based models and the new flash attention. EVEN LLAMA3 did not exhibit this behavior...

Model Beam Size Flash Attention VRAM No Flash Attention VRAM
SOLAR-10.7B-Instruct-v1.0-ct2-int8 1 25.60 14004.80 23.79 14414.86
2 22.43 14037.98 19.23 15660.13
3 21.14 14359.13 18.40 16636.70
4 19.83 14649.78 14.97 17790.98
5 21.65 14910.16 14.92 18824.00
Llama-2-7b-chat-hf-ct2-int8 1 40.19 10402.61 37.99 10484.38
2 29.86 11147.03 31.41 11156.53
3 30.74 11986.32 26.77 12104.95
4 29.19 12637.21 25.26 12919.88
5 28.04 13406.88 22.40 13688.41
neural-chat-7b-v3-3-ct2-int8 1 39.22 10415.16 34.63 10809.51
2 34.69 10408.29 28.87 11542.85
3 34.22 10501.13 27.47 12279.13
4 33.29 10714.73 25.65 13071.67
5 32.41 10958.45 23.43 13703.79
Meta-Llama-3-8B-Instruct-ct2-int8 1 35.11 11131.16 32.78 11558.52
2 31.20 11139.53 27.87 12277.64
3 29.07 11197.97 25.22 12900.71
4 27.31 11421.54 22.88 13541.64
5 26.02 11549.96 21.00 14071.92
Mistral-7B-Instruct-v0.2-ct2-int8 1 38.85 10423.97 35.98 10793.45
2 33.74 10449.46 24.33 11667.12
3 28.37 10610.54 25.03 12476.32
4 26.03 10838.03 21.04 13130.21
5 25.74 11028.20 21.01 13869.22
gemma-1.1-2b-it-ct2-int8 1 60.94 5508.79 56.16 5632.02
2 47.82 5579.34 33.41 5745.16
3 39.54 5550.84 27.72 5955.45
4 32.76 5533.56 28.71 6164.21
5 29.49 5573.55 23.32 6320.77

To further is a chart for transformers + bitsandbytes running in 4-bit mode, which can use a beam_size parameter (GGUF cannot). Overall, you see the same behavior regarding less tokens per second and more VRAM as beams increase. Interestingly, however, TPS decreases much more slowly after beams=2 and is within the margin of error...while VRAM continues to increase exonentially. I have not been able to test transformers + bitsandbytes with flash attention2 yet unless someone can help me set it up on Windows...I've tried repeatedly...


I means number of token of input. I would be great to compare with and without FA2 with the prompt's size from 1000 to 3000 tokens. I think the prompt that you used is quite good but not sure it is long enough.

I means number of token of input. I would be great to compare with and without FA2 with the prompt's size from 1000 to 3000 tokens. I think the prompt that you used is quite good but not sure it is long enough.

Good to know. If one is efficient at RAG then you shouldn't need more than 3 contexts in MOST use cases. However, in a substantial minority of cases it's sometimes beneficial to have much larger inputs to a model...e.g. 1-3k, and leave 1k for the model to respond. I just haven't had a chance to test that yet but it's on my "to do" list. ;-)

Running llama2-13b with flash attention on and off lead the the same result as with llama2-7b...Only a miniscule advantage of using flash attention, nowhere near the advantages with mistral-based models and/or SOLAR (and Llama3 for that matter):

Flash Attention True:

Model Beam Size Tokens per Second VRAM Usage (MB)
dolphin-llama2-7b-ct2-int8 1 39.43 10040.58
dolphin-llama2-7b-ct2-int8 2 34.01 10601.28
dolphin-llama2-7b-ct2-int8 3 28.87 11357.01
dolphin-llama2-7b-ct2-int8 4 28.21 12045.11
dolphin-llama2-7b-ct2-int8 5 26.31 12727.59

Flash Attention False:

Model Beam Size Tokens per Second VRAM Usage (MB)
dolphin-llama2-7b-ct2-int8 1 36.55 10273.63
dolphin-llama2-7b-ct2-int8 2 30.82 10911.19
dolphin-llama2-7b-ct2-int8 3 27.89 11836.30
dolphin-llama2-7b-ct2-int8 4 24.67 12541.38
dolphin-llama2-7b-ct2-int8 5 22.36 13321.12

I benched

With Flash Attention:

Model Beam Size Tokens per Second VRAM Usage (MB)
dolphin-llama2-7b-ct2-int8 1 39.43 10040.58
dolphin-llama2-7b-ct2-int8 2 34.01 10601.28
dolphin-llama2-7b-ct2-int8 3 28.87 11357.01
dolphin-llama2-7b-ct2-int8 4 28.21 12045.11
dolphin-llama2-7b-ct2-int8 5 26.31 12727.59

Without Flash Attention:

Model Beam Size Tokens per Second VRAM Usage (MB)
dolphin-llama2-7b-ct2-int8 1 36.55 10273.63
dolphin-llama2-7b-ct2-int8 2 30.82 10911.19
dolphin-llama2-7b-ct2-int8 3 27.89 11836.30
dolphin-llama2-7b-ct2-int8 4 24.67 12541.38
dolphin-llama2-7b-ct2-int8 5 22.36 13321.12

Once again, a llama-2 based model does not reap the benefits of flash attention like a Mistral/SOLAR/llama3 model does...To be clear, there's some benefit (more than with vanilla llama2), but not exponentially like with those other model architectures...

More benchmarks...wanted to see if flash attention was better utilized when running in bfloat16, the model's original format, which it still doesn't benefit as much as mistral/solar/llama3...

With Flash Attention:

Model Beam Size Tokens per Second VRAM Usage (MB)
Phi-3-mini-4k-instruct-ct2-bfloat16 1 63.03 11023.59
Phi-3-mini-4k-instruct-ct2-bfloat16 2 39.87 11634.92
Phi-3-mini-4k-instruct-ct2-bfloat16 3 33.28 12427.27
Phi-3-mini-4k-instruct-ct2-bfloat16 4 29.81 12955.23
Phi-3-mini-4k-instruct-ct2-bfloat16 5 29.18 13469.11

No Flash Attention:

Model Beam Size Tokens per Second VRAM Usage (MB)
Phi-3-mini-4k-instruct-ct2-bfloat16 1 59.10 11318.62
Phi-3-mini-4k-instruct-ct2-bfloat16 2 35.51 11890.77
Phi-3-mini-4k-instruct-ct2-bfloat16 3 32.11 12638.66
Phi-3-mini-4k-instruct-ct2-bfloat16 4 27.03 13305.93
Phi-3-mini-4k-instruct-ct2-bfloat16 5 27.04 14034.74