langchain-ai / langchain

๐Ÿฆœ๐Ÿ”— Build context-aware reasoning applications

Home Page:https://python.langchain.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Chroma VectorBase Use "L2" as Similarity Measure Rather than Cosine

DragonMengLong opened this issue ยท comments

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

vectorstore = Chroma(
persist_directory=persisit_dir,
embedding_function=embeddings
)
docs_and_scores = vectorstore.similarity_search_with_score(query=user_query)
for doc, score in docs_and_scores:
print(score)

Error Message and Stack Trace (if applicable)

No response

Description

In the doc of langchain, it said chroma use cosine to measure the distance by default, but i found it actually use l2 distence, if we debug and follow into the code of the chroma db we can find that the default distance_fn is l2

System Info

langchain==0.1.17
langchain-chroma==0.1.0
langchain-community==0.0.37
langchain-core==0.1.52
langchain-text-splitters==0.0.1
chroma-hnswlib==0.7.3
chromadb==0.4.24
langchain-chroma==0.1.0

The distance function is decided by the metadata, and if the collection already exists(when loading from disk), the metadata is same as the metadata when we save to disk. So to use cosine distance we need to specific the metadata like this

db = Chroma.from_documents(documents, embeddings, persist_directory=persist_dir, collection_metadata={"hnsw:space": "cosine"})