microsoft / kernel-memory

RAG architecture: index and query any data using LLM and natural language, track sources, show citations, asynchronous memory patterns.

Home Page:https://microsoft.github.io/kernel-memory

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Make ITextEmbeddingGenerator.CountTokens and ITextGenerator.CountTokens ValueTask<int>

JohnGalt1717 opened this issue · comments

Right now these are synchronous, but if you're using an online service to implement these (i.e. LLama.cpp server) then these need to be able to return async responses. Having it return ValueTask would be greatly helpful.

Inversely GenerateEmbeddingAsync could be ValueTask as well.

We tried making it async during the initial implementation, but it would affect the speed and complexity of the text chunker that would need quite a bit of rewrite, and raise questions about the usage. Counting tokens is currently meant to be fast&free, e.g. we use CountTokens even for logging statements. If we modify that assumption, we'll need to reassess each use case to avoid unnecessary calls and unforeseen expenses.

OK, well what I'm doing is creating an implementation of the LLama API (native not emulated). I can't find a good LLama token counter for C# that will do it without the API call.

Suggestions?

IIRC Llama uses SentencePiece, anything available in that direction?

Seems like you guys have one?

https://github.com/microsoft/BlingFire

Did you end up going with this? I'm facing this exact issue right now and haven't found a good solution. I'm still doing it async, but just using .GetAwaiter().GetResult(); which is terrible. BlingFire seems ok, but also requires you to load and unload the model each time, which seems bad to me.

I don't exactly see why this method can't be async