Issues using with whisper-rs

Question

Issues using with whisper-rs

jafioti opened this issue a year ago · comments

Hi, I'm trying to use llm on the same project where I'm already using whisper-rs (https://github.com/tazz4843/whisper-rs) and the ggml's for each of the projects seem to be interfering with eachother. Could it be because of the crates looking for the same files, and cargo folding them into the same dependency?

For instance, when I load up a model in llm, I get this error: thread 'main' panicked at 'called Result::unwrap()on anErr value: InvariantBroken { path: Some("./models/llama-2-7b-chat.ggmlv3.q4_0.bin"), invariant: "226001103 <= 2" }

When I remove whisper-rs from my project, it compiles and runs fine.

Any ideas how to resolve this? I'd assume I can just rename one of the sys crates, but it doesn't seem to be helping.

Philpax · Answer 1 · Thu Aug 17 2023 15:26:45 GMT+0800 (China Standard Time)

Yeah, that's unfortunately a little gnarly because both llm and whisper-rs use GGML - which is a C library with no function name mangling - so the linker has to pick one of the two conflicting implementations (and I believe whisper's is much older). Honestly, I'm surprised it compiled at all!

I would quite like to see an implementation of whisper in Rust, but it would require someone with more free time than me to do it.

Depending on how badly you need it, you could fork whisper-rs and whisper.cpp and rename things so that there are no conflicts, but that's obviously not ideal. For a short-term hacky fix, I'd suggest just breaking out the whisper-rs code into a separate application or dynamic library to ensure that the linker doesn't see both GGML implementations 😦

Lukas Kreussel · Answer 2 · Thu Aug 17 2023 16:21:31 GMT+0800 (China Standard Time)

Candle has a completely rust native whisper example, which runs relatively fast. It doesn't support GGML models yet, but that's currently being worked on.

Philpax · Answer 3 · Thu Aug 17 2023 16:28:07 GMT+0800 (China Standard Time)

Of course! Do you know if they have any plans to break out the examples into their own libraries?

Lukas Kreussel · Answer 4 · Thu Aug 17 2023 16:37:42 GMT+0800 (China Standard Time)

Well, i actually don't know. I'm currently only focusing on helping a bit with the quantization support. But i guess they wouldn't be unwilling to split it into libraries.

Joe Fioti · Answer 5 · Fri Aug 18 2023 11:59:36 GMT+0800 (China Standard Time)

Happy to report that the candle whisper demo works great! Certainly slower than ggml, but still reasonably fast. I'll close this out since it's not really an issue with this crate in particular.

Lukas Kreussel · Answer 6 · Fri Aug 18 2023 16:14:10 GMT+0800 (China Standard Time)

@jafioti Theoretically candle should support quantized ggml tensors since yesterday meaning you probably can recreate the wisper.cpp with candle as a backend and should get basically the same performance. Currently only q4_0 is supported but i'm planning to port most of the quantization formats over.

Joe Fioti · Answer 7 · Sat Aug 19 2023 07:57:01 GMT+0800 (China Standard Time)

@jafioti Theoretically candle should support quantized ggml tensors since yesterday meaning you probably can recreate the wisper.cpp with candle as a backend and should get basically the same performance. Currently only q4_0 is supported but i'm planning to port most of the quantization formats over.

Is there an example of using the 4 bit quantization? I'm using candle's llama, but when I set the dtype to u8 I get not implemented errors

Lukas Kreussel · Answer 8 · Sat Aug 19 2023 15:45:11 GMT+0800 (China Standard Time)

Is there an example of using the 4 bit quantization? I'm using candle's llama, but when I set the dtype to u8 I get not implemented errors

Take a look at the qantized llama example. Basically only the matmul opperation supports quantized tensors and will always produce a f32/f16 output meaning your weights are stored in the quantized format, but during inferenze you can use all candle operations as normal. You can create these QTensors either from a ggml file or from normal f32 tensors by quantizing them.