philippgille / chromem-go

Embeddable vector database for Go with Chroma-like interface and zero third-party dependencies. In-memory with optional persistence.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

chromem-go

Go Reference

Embeddable in-memory vector database for Go with Chroma-like interface and zero third-party dependencies.

It's not a library to connect to ChromaDB. It's an in-memory database on its own.

Being embeddable enables you to add retrieval augmented generation (RAG) and similar embeddings-based features into your Go app without having to run a separate database. Like when using SQLite instead of PostgreSQL/MySQL/etc.

The focus is not scale or number of features, but simplicity.

Contents

  1. Use cases
  2. Interface
  3. Features
  4. Usage
  5. Motivation
  6. Related projects

Use cases

With a vector database you can do various things:

  • Retrieval augmented generation (RAG), question answering (Q&A)
  • Text and code search
  • Recommendation systems
  • Classification
  • Clustering

Let's look at the RAG use case in more detail:

RAG

The knowledge of large language models (LLMs) - even the ones with with 30 billion, 70 billion paramters and more - is limited. They don't know anything about what happened after their training ended, they don't know anything about data they were not trained with (like your company's intranet, Jira / bug tracker, wiki or other kinds of knowledge bases), and even the data they do know they often can't reproduce it exactly, but start to hallucinate instead.

Fine-tuning an LLM can help a bit, but it's more meant to improve the LLMs reasoning about specific topics, or reproduce the style of written text or code. Fine-tuning does not add knowledge 1:1 into the model. Details are lost or mixed up. And knowledge cutoff (about anything that happened after the fine-tuning) isn't solved either.

=> A vector database can act as the the up-to-date, precise knowledge for LLMs:

  1. You store relevant documents that you want the LLM to know in the database.
  2. The database stores the embeddings alongside the documents, which you can either provide or can be created by specific "embedding models" like OpenAI's text-embedding-3-small.
    • chromem-go can do this for you and supports multiple embedding providers and models out-of-the-box.
  3. Later, when you want to talk to the LLM, you first send the question to the vector DB to find similar/related content. This is called "nearest neighbor search".
  4. In the question to the LLM, you provide this content alongside your question.
  5. The LLM can take this up-to-date precise content into account when answering.

Check out the example code to see it in action!

Interface

For the full interface see https://pkg.go.dev/github.com/philippgille/chromem-go.

Our inspiration was the Chroma interface, whose core API is the following (taken from their README):

import chromadb
# setup Chroma in-memory, for easy prototyping. Can add persistence easily!
client = chromadb.Client()

# Create collection. get_collection, get_or_create_collection, delete_collection also available!
collection = client.create_collection("all-my-documents")

# Add docs to the collection. Can also update and delete. Row-based API coming soon!
collection.add(
    documents=["This is document1", "This is document2"], # we handle tokenization, embedding, and indexing automatically. You can skip that and add your own embeddings as well
    metadatas=[{"source": "notion"}, {"source": "google-docs"}], # filter on these!
    ids=["doc1", "doc2"], # unique for each doc
)

# Query/search 2 most similar results. You can also .get by id
results = collection.query(
    query_texts=["This is a query document"],
    n_results=2,
    # where={"metadata_field": "is_equal_to_this"}, # optional filter
    # where_document={"$contains":"search_string"}  # optional filter
)

Our Go library exposes the same interface:

package main

import "github.com/philippgille/chromem-go"

func main() {
    // Set up chromem-go in-memory, for easy prototyping. Persistence will be added in the future.
    // We call it DB instead of client because there's no client-server separation. The DB is embedded.
    db := chromem.NewDB()

    // Create collection. GetCollection, GetOrCreateCollection, DeleteCollection also available!
    collection := db.CreateCollection("all-my-documents", nil, nil)

    // Add docs to the collection. Update and delete will be added in the future.
    // Row-based API will be added when Chroma adds it!
    _ = collection.Add(ctx,
        []string{"doc1", "doc2"}, // unique ID for each doc
        nil, // We handle embedding automatically. You can skip that and add your own embeddings as well.
        []map[string]string{{"source": "notion"}, {"source": "google-docs"}}, // Filter on these!
        []string{"This is document1", "This is document2"},
    )

    // Query/search 2 most similar results. Getting by ID will be added in the future.
    results, _ := collection.Query(ctx,
        "This is a query document",
        2,
        map[string]string{"metadata_field": "is_equal_to_this"}, // optional filter
        map[string]string{"$contains": "search_string"},         // optional filter
    )
}

Initially chromem-go started with just these methods, but we added more over time. We intentionally don't want to cover 100% of Chroma's API surface though.
Instead, we will add some alternative methods that are more Go-idiomatic.

See the Godoc for details: https://pkg.go.dev/github.com/philippgille/chromem-go

Features

  • Zero dependencies on third party libraries
  • Embeddable (like SQLite, i.e. no client-server model, no separate DB to maintain)
  • Multi-threaded processing (when adding and querying documents), making use of Go's native concurrency features
  • Embedding creators:
    • OpenAI (default)
    • Mistral
    • Jina
    • mixedbread.ai
    • LocalAI
    • Bring your own
    • You can also pass existing embeddings when adding documents to a collection.
    • ollama
      • (As of 2024-02-10 their OpenAI compatible API doesn't support embeddings yet, but they have a custom API which does)
  • Similarity search:
    • Exact nearest neighbor search using cosine similarity
    • Approximate nearest neighbor search with index
      • Hierarchical Navigable Small World (HNSW)
      • Inverted file flat (IVFFlat)
  • Filters:
    • Document filters: $contains, $not_contains
    • Metadata filters: Exact matches
    • Operators ($and, $or etc.)
  • Storage:
    • In-memory
    • Persistent (file)
    • Persistent (others (S3, PostgreSQL, ...))
  • Data types:
    • Documents (text)
    • Images
    • Videos

Usage

For a full, working example, using the vector database for retrieval augmented generation (RAG), see example/main.go

Motivation

In December 2023, when I wanted to play around with retrieval augmented generation (RAG) in a Go program, I looked for a vector database that could be embedded in the Go program, just like you would embed SQLite in order to not require any separate DB setup and maintenance. I was surprised when I didn't find any, given the abundance of embedded key-value stores in the Go ecosystem.

At the time most of the popular vector databases like Pinecone, Qdrant, Milvus, Weaviate and others were not embeddable at all. ChromaDB was, but only in Python.

Then I found @eliben's blog post and example code which showed that with very little Go code you could create a very basic PoC of a vector database.

That's when I decided to build my own vector database, embeddable in Go, inspired by the ChromaDB interface. ChromaDB stood out for being embeddable (in Python), and by showing its core API in 4 commands on their README and on the landing page of their website.

Related projects

  • Shoutout to @eliben whose blog post and example code inspired me to start this project!
  • Chroma: Looking at Pinecone, Qdrant, Milvus, Weaviate and others, Chroma stood out by showing its core API in 4 commands on their README and on the landing page of their website. It was also the only one which could be embedded (in Python).
  • The big, full-fledged client-server-based vector databases for maximum scale and performance:
    • Pinecone: Closed source
    • Qdrant: Written in Rust
    • Milvus: Written in Go, but not embeddable as of December 2023
    • Weaviate: Written in Go, but not embeddable as of December 2023

About

Embeddable vector database for Go with Chroma-like interface and zero third-party dependencies. In-memory with optional persistence.

License:GNU Affero General Public License v3.0


Languages

Language:Go 100.0%