wagtail / wagtail-ai

Get help with your Wagtail content using AI superpowers.

Home Page:https://wagtail-ai.readthedocs.io/latest

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Discussion: adding a vector database

tomdyson opened this issue · comments

As discussed with @tomusher on Slack, a future goal for this project could be to support completions or chat-style responses using the context of existing content. Functionality like this typically requires a vector database for efficient similarity searching ('find the five fragments of existing content most similar to the following phrase, to provide context for the completion prompt').

How should we approach this? Some options:

  • Use Postgres and pgvector (e.g. this supabase blog post). Pros: uses the database that most people are already using in production; should allow joins across Wagtail data. Cons: could be a hassle to install.
  • Use FAISS. Pros: can be pip-installed; fast; well documented; widely used. Cons: persistence requires pickling; hard to share between horizontally scaled apps; doesn't support joins across Wagtail data.
  • Use ChromaDB. Pros: pip-installable; has a client-server mode; has some (non-SQL) support for filtering. Cons: very new; lots of dependencies; currently lacks Python 3.11 support; needs at least 2GB.
  • Use LangChain as an abstraction layer. Pros: supports many vector stores, including FAISS, ChromaDB, Elasticsearch. Cons: wouldn't support joins across Wagtail data; creates a complicated decision for developers.

I'm leaning towards the first option, on the grounds that connecting relational Wagtail data with similarity searches could unlock some really interesting functionality, but installation hassles could rule it out (e.g. if it's not possible to use on Heroku, RDS etc).

Considering this is an area that's rapidly evolving, and as you mention there are many conflicting requirements for a potential deployment, I'd lean towards leveraging existing abstractions as much as possible so that we can swap out/support multiple options - even if we don't integrate/document every possible option; we could at least focus on having a:

  • Simple, easily deployable on a PaaS option
  • Scalable, client-server option

LangChain and DocArray seem to be the main candidates for this, the latter (while in a state of development flux) supports Redis and SQLite (in a pinch).

I'm interested in hearing more about why we can't do joins across Wagtail data with anything other than pgvector. As I understand it, whatever we use we can construct indices across any set of data, regardless of the structure of the models in Wagtail. Is it tools for joining across indices that you'd be considering in pgvector?

Thanks, I hadn't seen DocArray. Using Redis as a vector is very interesting, especially if Vector Similarity is supported by hosted Redis services. I just found this helpful thread.

I'm interested in hearing more about why we can't do joins across Wagtail data with anything other than pgvector.

I didn't express this clearly, sorry. I meant that it would be useful to be able to build prompts based on queries like "return all pages of type BlogPost published in the last 6 months, which are similar to the following phrase". With pgvector we could do this in a single query. With separate vector stores like FAISS we'd have to return IDs of similar documents first then filter them through the Wagtail database.

A note: pgvector is currently supported by a few of the major Postgres providers, though not AWS / GCP / Azure.

Note Langchain added pgvector support yesterday (for the JS/TS version of Langchain, hopefully the Python version will follow soon). An abstraction which supports Postgres / pgvector as well as FAISS etc feels like the best case scenario.

Langchain just added Redisearch and pgvector support to the Python version.

Closing as this is now a feature of wagtail-vector-index, thanks for the suggestion @tomdyson !