Mozilla-Ocho / llamafile

Distribute and run LLMs with a single file.

Home Page:https://llamafile.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Unexpected output from server.cpp `/embedding` endpoint

k8si opened this issue · comments

What is the issue?

The embeddings produced by a model running in llamafile seem to be substantially different from those produced by llama.cpp.

llama.cpp embeddings are very close (~0.99 cosine similarity) to those produced by the same model via HuggingFace (which I'm treating as the 'reference embeddings'). On the other hand, llamafile embeddings only get ~0.6 cosine similarity to the HuggingFace embeddings. I tested this across multiple llamafile versions (see results below).

Tested with:

  • llamafile versions from v0.8.1 - v0.7.1
  • llama.cpp commit: 6ecf3189
  • MacBook Pro with Apple M2 Pro (32 GB)
  • MacOS 14.2.1
  • Only tested with one model: all-MiniLM-L6-v2 (BERT architecture)

How to replicate the issue

I put all the scripts/information to replicate this issue in this repo: https://github.com/k8si/replicate-llamafile-embeddings-issue

The short version:

To inspect the differences between embeddings produced by different backends, I embed the text "Alice has had it with computers." with the same(-ish) model running in HF, llama.cpp, and llamafile:

  1. HuggingFace - used sentence-transformers/all-MiniLM-L6-v2 pytorch weights directly
  2. llamafile - used full F32 GGUF version of the model from leliuga/all-MiniLM-L6-v2-GGUF
  3. llama.cpp - used full F32 GGUF version of the model from leliuga/all-MiniLM-L6-v2-GGUF

I use the F32 GGUF to remove any quantization effects and stay as equivalent to the HuggingFace reference model as possible.

Then, I look at the cosine similarity between the embedding produced by HF vs llamafile and compare this to the cosine-sim between the embedding from HF vs llama.cpp. I would expect the two cosine-sim scores to be the same, but they are not, as the results below show.

Results

Results across the last 6 llamafile releases (v0.7.1 to v0.8.1):

$ cat results/results-* | grep -A 2 "RESULTS"

RESULTS (llamafile v0.7.1):
cosine-sim(emb_hf, emb_llamafile) = 0.635290
cosine-sim(emb_hf, emb_llamacpp) = 0.999999
--
RESULTS (llamafile v0.7.2):
cosine-sim(emb_hf, emb_llamafile) = 0.635290
cosine-sim(emb_hf, emb_llamacpp) = 0.999999
--
RESULTS (llamafile v0.7.3):
cosine-sim(emb_hf, emb_llamafile) = 0.635290
cosine-sim(emb_hf, emb_llamacpp) = 0.999999
--
RESULTS (llamafile v0.7.4):
cosine-sim(emb_hf, emb_llamafile) = 0.635290
cosine-sim(emb_hf, emb_llamacpp) = 0.999999
--
RESULTS (llamafile v0.8):
cosine-sim(emb_hf, emb_llamafile) = 0.605049
cosine-sim(emb_hf, emb_llamacpp) = 0.999999
--
RESULTS (llamafile v0.8.1):
cosine-sim(emb_hf, emb_llamafile) = 0.605049
cosine-sim(emb_hf, emb_llamacpp) = 0.999999
--

The test does not work prior to v0.7.1 as BERT was not supported before this release, and all-MiniLM-L6-v2 is a BERT architecture.

It turns out the upstream project is inserting CLS and SEP tokens around the input before passing them to llama_decode(). I've identified the key line in the server code that needs to change, to make our embedding output consistent with llama.cpp in this case. With the change I'm about to push, cosine similarity will be 0.9999+ similar to llama.cpp.

Please note we're no longer importing upstream changes on the server. The upstream implementation has diverged significantly since they removed LLaVA support. You will likely encounter other differences in behavior. If you do, feel free to file another issue and I'll pinpoint what needs to change.