ngxson / wllama

WebAssembly binding for llama.cpp - Enabling in-browser LLM inference

Home Page:https://ngxson.github.io/wllama/examples/basic/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How would you implement RAG / Document chat?

flatsiedatsie opened this issue · comments

In your readme you mention:

Maybe doing a full RAG-in-browser example using tinyllama?

I've been looking into a way to allow users to 'chat with their documents'. A popular concept. Specifically I was looking into 'Fully local PDF chatbot'. It seems.. complicated.

So I was wondering: if one wanted to implement this feature using Wllama, what are the 'components' of such a solution?

Would it be something like...

  • Wllama's embedding feature turns text chunks into vector objects?
  • Those could then be stored in Voy?
  • magic
  • magic
  • The user gets an answer to their question, e.g. "The sun is 2948520 degrees, which I found on page 16"?

What would the steps actually be?

A classic RAG system consist of a vector database + a generative model. With wllama, this can be archived by:

  • Embedding model. However, I still couldn't find a good embedding model.
  • For database, we can use Voy as you mentioned, or HNWS which is a pure JS implementation (we don't need to much performance on this part, our database is relatively small anyway)
  • A good generative model that does not hallucinate. This is very important and requires using specific model, for example Llama3-ChatQA-1.5-8B by nvidia. These models are generally "dumb" or "stupid", but will be safe because they don't make up information if it is not found in our RAG

Another idea that is only possible if your document is short and predefined, is to construct a session and reuse it later (via sessionSave and sessionLoad) - This is useful in my case for example, if the chatbot is purely to introduce a specific website, we don't even need to make a vector database or to have embeddings at all. The downside is that this is not practical for any other usages.

For a small embedding model good for this case, I can recommend this one:
sentence-transformers/multi-qa-MiniLM-L6-cos-v1 (GGUF)

Getting there...

Screenshot 2024-05-21 at 14 01 51

Currently using Transformers.js because I could find easy to copy examples:

extractor = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2', {
			quantized: false,
			progress_callback: data => {
				self.postMessage({
					type: 'embedding_progress',
					data
				});
			}

        });

		embeddings =  await extractor(texts, { pooling: 'mean', normalize: true });

I've alse seen mention of this model for embedding: nomic-ai/nomic-embed-text-v1? But for now.. it works.

Next: get an LLM to summarize the chunks.

Ah nice. I tried nomic-embed-text before but it doesn't work very well. But maybe because I used Albert Einstein wiki page as the example, which is a very hard one.

Maybe you can give it a try?

Some questions that I tried but no success:

  • Does he play guitar?
  • Does he have a child?
  • How many wives does he have?

Some questions that I tried but no success:
Does he play guitar?

Did you let the LLM re-formulate the prompt first? In my project I just added the step to do that by looking at the conversation history first and rewriting the user's prompt to be explicit. So "he" becomes "Albert Einstein. It seems to work.

In fact it's all now working. Although the answer in this case seems almost too good to be solely based on the retrieved chunks..

Screenshot 2024-05-27 at 08 40 37