kiwix / overview

:balloon: Start here for current projects, how to get involved with offline projects and joining community calls. A resource for new and veteran members

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

BERT embeddings search (Natural language question answering)

walking-octopus opened this issue · comments

I don't know if it's an appropriate place to submit such proposals, but I'd like to introduce the idea somewhere.

BERT is an encoder-decoder language model that can extract the meaning of words, sentences, passages, and documents into a 768 dimensional vectors. This allows the creation of new models to map things like images to it, translators that can map between languages, complex NLP classifiers, do topic clustering, and most importantly here, search.

By preferably fine-tuning the model for search-like scenarios, embedding the search query, and finding vectors close to it in the index, we can get reliable search results that match concepts, not just text strings.

After playing around with codequestion, an offline StackExchange search tool, I found it to be quite accurate, small, and speedy. Its document db weights 1.5 GB, which ether means they've done some filtering or used a more efficient format. Its vector embeddings index only weights 260MB, which is quite little. While embeddings may take some compute to generate, they're very efficient to search afterwards.

Plus, I guess topic clustering and classification may help with filtering out non-essential data in large archives, like spam or niche categories.

commented

@walking-octopus Thank you for your ticket. You are at the right place. So far we don't have discussed the topic of embeding ML based capacity within Kiwix. This is not a topix we are familiar with, but there is no taboo about that.

I don't have tested/analysed so far the link you have shared. Might be indeed the right time to experiment.

One strong engineering constraint I see is that all of this should be in native code, bettwr in c++, to be mergef within the libkiwix.

One strong engineering constraint I see is that all of this should be in native code, better in c++, to be merged within the libkiwix.

While most of the ML ecosystem is in Python, I suppose libtorch would be a giant unacceptable dependency.

Thankfully, someone recently re-implemented BERT with GGML (a simple ML library for C++), so now we have BERT.cpp.

But it seems to have some limitations (at the time of writing):

  • Tokenizer doesn't correctly handle Asian writing (CJK, maybe others);
  • GGML does not yet support real batching, so it is slower than it could be;
  • Doesn't respect tokenizer, pooling or normalization settings from the model card.

Additionally, for vector similarity search, maybe this library from Facebook can help, faiss.

commented

Topic is trendy, @santhoshtr has made a POC using an other tech stak, explained it and has published the code https://thottingal.in/blog/2023/07/21/wikiqa/. At this stage, I believe we should reuse this code to see how it scales with a full Wikipedia ZIM file. Should be doable easily using python-libzim.

I have a concern regarding the storage of the vector+metadata. SQLite might be good, but I'm not that eager to embed a SQLite engine withing Kiwix, A simpler/dedicated fat file storage would be better.

@walking-octopus Any thoughts?

@kelson42 I would like to give it a try, and i am currently learning about it on Google. can you tell me more about what exactly we need to implement in Kiwix, and maybe you can please break it down.

commented

@amitpanwar789 This is not related to kiwix directly, at least not yet. This is a research ticket. I have transfered it to kiwix/overview repository which is more appropriate.

This approach uses RAG to create snippets of every article. Having experimented with RAG search, I find it very hit-and-miss, as a lot depends on how you "cut" the articles into snippets. Data preparation and data cleaning are crucial. You need to have some overlap between the snippets, ideally not cutting them before section headings or titles. Using a fixed window of x tokens is likely to give poor results, or lots of repeated search results as we see in the demo.

Calculating vectors for RAG is quite processor-intensive (it took 10 minutes to calculate the vectors for a small selection of PDF articles on my reasonably powerful CPU).

While RAG would certainly be useful for smaller ZIMs, especially for those where we cannot be sure that the model has been trained on the data contained in the ZIM, it strikes me as somewhat redundant for full Wikipedia search with Kiwix. This is because we know that all recent models are already trained on the whole of Wikipedia (we'd need to check if that holds for all languages and all models). Llama 3 can already do reverse lookup with (at least) full English Wikipedia very competently: a carefully crafted prompt will generate search terms (terms in square brackets in image below). All that would be needed in that case would be to check that each search term yields results in our existing search, and link it to the relevant article if so.

image

I would like to outline some considerations on implementing this feature.

First-of-all, how are we going to store the embeddings? One vector costs 6145 bytes and there are 6,821,862 articles as of May 10th 2024. If we were to just attach a blob to every article, it would cost ~42 GB just to store the embeddings (discounting quantization), not to mention the memory required to query them all. We would certainly need to use product quantization and graph-based indexing.

Needless to say, I think this is a domain where, at this scale, it is best to off source such concerns to a dedicated library. The best choice may be either raw hnswlib. It provides indexing, querying, and efficient serialization, while being just a header-only C++ library, or faiss, which is built on top of it and seems to add more efficient search and claims to support working with collections that do not fit in RAM, but depends on BLAS.

The only remaining concern may be the embedding process, which is heavily compute-expensive if we are talking about LLMs. FastText seems to be the most efficient option, being somewhere in between a bag of words and embedding transformers in terms of quality. Wikipedia2Vec seems to be a project entirely dedicated to lightweight Wikipedia embeddings and seems to leverage additional metadata from links. And as for the fancy transformer models, all-MiniLM-L6-v2 seems quite lightweight and may be suitable, Cohere and OpenAI have higher accuracy and a free pre-embedded Wikipedia index but would need an internet connection and respective account to embed the user's query, so are not an option.

And finally, what exactly is going to be embedded? All embeddings are just an N-dimensional vector, so do cost the same to store (depending on said N of the embedding model), but the encoding has its own time/memory complexity, and so can scale rapidly with a large dataset. On top of that, LLM-based models have to be pretrained on a certain context-window, and are just incapable of understanding text beyond a certain token limit without some hacks. We would have to choose how much and what to embed, be it the title, the first paragraph, the abstract, question, most upvoted answer or whatever is appropriate for a given archive.

I think that for now, it may be wise to not even consider using transformer-based embeddings. Large datasets and heavy indexing do not mix well, especially when you are not a VC-funded startup burning money just to say you've added "AI". Let's consider how an extra vector index can be added, the UX of embedding-based navigation (be it text search, some pretty semantic graph/neighborhood, or just a generated "see also" section) and non-interactive use-cases, and if that proves useful, make a proof-of-concept with something like a medium-sized StackExchange forum or SimpleWikipedia using FastText.

I am very excited for how massive of an improvement to navigation this can bring, but we'd need to think this through first.

Those considerations seem very apt to me @walking-octopus. Thank you for explaining them so clearly. Indeed experimentation should be done on small datasets to assess viability and how much of an increase in ZIM size the inclusion of embeddings would entail.

IMHO, for the huge datasets like full-English Wikipedia, it seems more sensible to leverage the fact that AIs such as Llama 3 have already been trained on the dataset we are targeting. The size of a very competent Llama 3 quant is ~4.7GB, or 8GB for the Q8 GGUF. Neither seems excessive in the context of 100GB Wikipedia ZIM. I can run the 8GB model on CPU only at an acceptable speed.