asg017 / sqlite-vss

A SQLite extension for efficient vector search, based on Faiss!

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Support memory-mapped on-disk Indices

asg017 opened this issue · comments

The underlying Faiss indicis are stored in SQLite shadow tables, which can't be mmaped with the IO_FLAG_MMAP.

One solution: Introduce a new option to store a vss0 column index on disk, allowing mmaped indices for larger-than-memory.

create virtual table articles using vss0(
  headline_embedding(1024) factory="..." on_disk=True,
  description_embedding(1024) factory="..." on_disk=True,
);

Then, your directory would look like:

$ tree .
.
├── my_data.db
├── my_data.db.vss0.articles.description_embedding.faissindex
└── my_data.db.vss0.articles.headline_embedding.faissindex

sqlite3_db_filename() would be useful here.

One problem: It's kindof nice to have all Faiss indices stored on one file in the SQLite database, and this config option would instead mean users would have to move around multiple files around instead of a single SQLite file. But since this is an "optimization" feature that's not enabled by default, I think it'll be ok.

I suppose that on each new insertion to an indexed table makes the engine whole index BLOB to be updated, and database writes are done twice, what makes it slow.

And if the index files are not present on the folder, the code can recreate them from the content... (is it stored on 2 places?)

In this proposal, for memory-mapped on-disk indexes, it won't be stored twice. By default, the Faiss index is stored inside a "shadow table" in your SQLite DB, but this option would instead store it on disk as a separate file. It'll still work the same at a user perspective (ie same SELECT and INSERT statements), but under-the-hood the storage of the actual index would be different.

Right now the "shadow table" indexes are slow because we re-write the entire index at the end of every transactions that INSERT'ed or DELETE'ed to a vss0 table. That involves exporting the index to an in memory buffer, then re-writing the shadow table with the new contents, which isn't great. But if the Faiss index was its own file and memory mapped, then updates wouldn't be as drastic.

Thinking about this more: Instead of a on_disk= argument, I think we should change it to storage_type=faiss_ondisk. The default would be storage_type=faiss_shadow.

This is so we can easily support future storage backends like #30

@dleviminzi ok, I applied the new vss0 constructor parser to the main branch. You should be able to add a mmap=True flag inside parse_vss0_column_definition(), let me know if you run into any trouble with that.

I also change a bit of the logic of the the storage_type=faiss_ondisk logic. I'll also probably remove the schema from the generated file name, so for the following vss0 table on a database called my_database.db:

create virtual table vss_demo usinv vss0( 
  a(2) storage_type=faiss_ondisk
)

It currently saves vectors to the file:

my_database.db.main.vss_demo.a.faiss_index

But, when I change the schema, itll save to:

my_database.db.vss_demo.a.faiss_index

Mostly because I don't think the schema is required on the filename. In fact, I think it'll actually break on SQL that queries vss0 tables on ATTACHed databases, since it won't know to look into .main. file

You should be able to add a mmap=True flag inside parse_vss0_column_definition(), let me know if you run into any trouble with that.

I'll look through the changes and give it a go today.

I'll also probably remove the schema from the generated file name,

Yeah that makes sense.