Good ideas from llama.cpp

Question

Good ideas from llama.cpp

setzer22 opened this issue 2 years ago · comments

I've been tracking the llama.cpp repo. I'll use this issue to list any good ideas / things we should be aware of to keep up with in Rust land:

GPTQ quantization 👀 ggerganov/llama.cpp#9
Not sure how that is even possible (isn't the task I/O bound?), but people are claiming great speedups when loading the modelling in parallel. This should be pretty easy to implement using rayon. ggerganov/llama.cpp#85 (comment)
Seems there's an issue with the normalization function used. It should be RMSNorm. Would be good to keep an eye on this, and simply swap the the ggml function once it's implemented on the C++ side 👀 ggerganov/llama.cpp#173 (comment)
It looks like dropping to F16 for the memory_k and memory_v reduces memory usage. It is not known whether this hurts quality, but we should follow the C++ side and add a flag to drop to F16 for the memory. This would also make the cached prompts added as part of #14 take half the size on disk, which is a nice bonus: ggerganov/llama.cpp#154 (review)
Looks like the fix from #1 just landed upstream. We should make sure to fix it here too ggerganov/llama.cpp#161
The tokenizer used in llama.cpp has some issues. It would be better to use sentencepiece, which is the one that was used during the original LLaMA training. There seems to be a rust crate for sentencepiece. We should check if a drop-in replacement is possible ggerganov/llama.cpp#167

Philpax · Answer 1 · Thu Mar 16 2023 09:39:52 GMT+0800 (China Standard Time)

Suggest pinning this issue :>

Nicolas Patry · Answer 2 · Thu Mar 16 2023 15:53:09 GMT+0800 (China Standard Time)

For the tokenizer item I suggest using https://github.com/huggingface/tokenizers/

Should work out of the box once converted (when this PR lands: huggingface/transformers#21955 it should become a simple let tokenizer = Tokenizer::from_file("filename") ) Cheers!

Philpax · Answer 3 · Thu Mar 16 2023 19:15:05 GMT+0800 (China Standard Time)

RMS norm landed, but they've reported regressions. Need to keep an eye on that.

Dong Shin · Answer 4 · Thu Mar 16 2023 20:13:52 GMT+0800 (China Standard Time)

@Narsil Llamatokenizer need to byte fallback option.🥹

For the tokenizer item I suggest using https://github.com/huggingface/tokenizers/

Should work out of the box once converted (when this PR lands: huggingface/transformers#21955 it should become a simple let tokenizer = Tokenizer::from_file("filename") ) Cheers!

Nicolas Patry · Answer 5 · Fri Mar 17 2023 22:52:34 GMT+0800 (China Standard Time)

Good news everyone !

huggingface/tokenizers#1183

(If this goes, I'll try to make a release soon after)

Philpax · Answer 6 · Fri Mar 17 2023 23:06:48 GMT+0800 (China Standard Time)

Awesome! Looking forward to it :D

dnlmlr · Answer 7 · Sat Mar 18 2023 23:10:10 GMT+0800 (China Standard Time)

A small comment on the parallel loading: It is definitely possible to improve IO reads by parallelizing. This is much more effective on SSDs but still works on HDDs due to caching at different layers. However this should be configurable since the performance can start to degrade at certain points of parallelism, depending on the storage medium and also stuff like the kernel and buffer sizes

Nicolas Patry · Answer 8 · Tue Mar 21 2023 00:24:34 GMT+0800 (China Standard Time)

@dnlmlr Do you have bench to back that up ? I didn't found that to be the case whenever I tried.

Memory-mapping was always consistently better than reading a file (Provided you need the whole file) and it doesn't require parallism (at user-level that is, no idea how the kernel is handling it)

Philpax · Answer 9 · Sun Mar 26 2023 08:57:09 GMT+0800 (China Standard Time)

@setzer22 Are you okay with me closing this issue and splitting it into individual issues?

setzer22 · Answer 10 · Sun Mar 26 2023 21:52:03 GMT+0800 (China Standard Time)

Yup, sounds good 👍

Philpax · Answer 11 · Sun Mar 26 2023 22:07:05 GMT+0800 (China Standard Time)

This issue has been superseded by #35, #62, #78, #79 and #80.