belladoreai / llama-tokenizer-js

JS tokenizer for LLaMA 1 and 2

Home Page:https://belladoreai.github.io/llama-tokenizer-js/example-demo/build/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Discussion: using this with sqlite-vss

linonetwo opened this issue · comments

https://github.com/asg017/sqlite-vss is a Bring-your-own-vectors database, it is compatable with any embedding or vector data you have. Consider using OpenAI's Embeddings API, HuggingFace's Inference API, sentence-transformers, or any of these open source model. You can insert vectors into vss0 tables as JSON or raw bytes.

So I just need to separate long article into 1000 words chunk, and passing them to llama-tokenizer-js, then store to sqlite-vss. Later I can pass query words to llama-tokenizer-js, then use it as a vector to search things in the sqlite-vss?

(can I regard tokens as a vector that can use in the vector db?)

can I regard tokens as a vector that can use in the vector db?

The short answer is "no". An array of llama tokenIds can be considered to be a vector, but not in any manner that would be useful in these vector embedding engines. In order for embeddings to be useful, you need to create vectors with an algorithm that produces similar vectors for semantically similar input texts.

For example, the text "bishop captures queen" is semantically similar to the text "pawn advances" - both of these texts describe moves in a chess game. If you were to tokenize these texts with the llama-tokenizer-js (or any other tokenizer), the resulting tokenIds would be completely different. There would be virtually no similarity in the tokenids. That's because these tokenization algorithms were not designed to produce tokenIds in such a way that similar texts would produce similar tokenIds. It's a completely different use case. What you need is an algorithm like Word2Vec.

Thanks, I understand it now, so I can only use this tokenizer with https://github.com/Atome-FE/llama-node

And I will need to find a word2vecjs too.

I need both LLM and vector search in Tiddlywiki works...

so I can only use this tokenizer with https://github.com/Atome-FE/llama-node

You can use this tokenizer in any setup where you need to transform text to LLaMA tokens. Yes, you can use this tokenizer with llama-node, but that is not the only way to run LLaMA. For example, I am using oobabooga text webui to run a LLaMA model. I am not using llama-node myself.

And I will need to find a word2vecjs too.

If you need to use vector embeddings, then that is one option, but I only named it as an example. If you google to find out what people are using, you will probably find different algorithms for producing embeddings. Some of them will be better than others, but more importantly, some of them will be easier to set up than others. I'm not up to date on what's the "best" way to create embeddings currently.

Most options for embeddings will require a server that can run a Python stack. It might be difficult to find options that run purely inside the browser.