Chroma VectorBase Use "L2" as Similarity Measure Rather than Cosine
DragonMengLong opened this issue ยท comments
Checked other resources
- I added a very descriptive title to this issue.
- I searched the LangChain documentation with the integrated search.
- I used the GitHub search to find a similar question and didn't find it.
- I am sure that this is a bug in LangChain rather than my code.
- The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).
Example Code
vectorstore = Chroma(
persist_directory=persisit_dir,
embedding_function=embeddings
)
docs_and_scores = vectorstore.similarity_search_with_score(query=user_query)
for doc, score in docs_and_scores:
print(score)
Error Message and Stack Trace (if applicable)
No response
Description
In the doc of langchain, it said chroma use cosine to measure the distance by default, but i found it actually use l2 distence, if we debug and follow into the code of the chroma db we can find that the default distance_fn is l2
System Info
langchain==0.1.17
langchain-chroma==0.1.0
langchain-community==0.0.37
langchain-core==0.1.52
langchain-text-splitters==0.0.1
chroma-hnswlib==0.7.3
chromadb==0.4.24
langchain-chroma==0.1.0
The distance function is decided by the metadata, and if the collection already exists(when loading from disk), the metadata is same as the metadata when we save to disk. So to use cosine distance we need to specific the metadata like this
db = Chroma.from_documents(documents, embeddings, persist_directory=persist_dir, collection_metadata={"hnsw:space": "cosine"})