Adding quantized models

Question

Adding quantized models

ProfFan opened this issue 6 months ago · comments

Hi,

Thank you for this amazing model! I made a small tray utility with your model to convert LaTeX: https://github.com/ProfFan/Snap2LaTeX

However, running locally is not fast. It would be great if we can make quantized versions suitable for on-device inference :)

Norm Inui · Answer 1 · Thu Feb 15 2024 13:20:27 GMT+0800 (China Standard Time)

@ProfFan I'm glad this model is useful to you. Snap2LaTex is indeed impressive, and thank you for your efforts in making such a cool tool. While I'm not quite familiar with quantization, I believe I could develop a smaller Nougat-LaTeX model based on the nougat-small. Nougat-small has only 4 decoder layers, and according to my evaluation, it can achieve ~40 tokens/s on A100 with flash-attn2 in fp16

Fan Jiang · Answer 2 · Fri Feb 16 2024 01:51:16 GMT+0800 (China Standard Time)

For simple (non-multiline/array) equations (example):

even the larger model is pretty fast (using MPS backend), averaging about 4 secs after 1st run (shader compilation etc). So the current model is pretty usable already :)

For bigger matrices and multi-line equations the decoding time (as expected) grows exponentially. Interestingly converting the model to half precision does not help that much.

kingqiuol · Answer 3 · Sat May 11 2024 15:07:54 GMT+0800 (China Standard Time)

@ProfFan I'm glad this model is useful to you. Snap2LaTex is indeed impressive, and thank you for your efforts in making such a cool tool. While I'm not quite familiar with quantization, I believe I could develop a smaller Nougat-LaTeX model based on the nougat-small. Nougat-small has only 4 decoder layers, and according to my evaluation, it can achieve ~40 tokens/s on A100 with flash-attn2 in fp16

How to add flash-attn2 on nougat？donut-swin doesn't seem to support flash-attn2