Bug: Vector indexing becomes exponentially more time consuming

Question

Bug: Vector indexing becomes exponentially more time consuming

dustyatx opened this issue 2 months ago · comments

Describe the bug

Having set a table with Mtree index on my embeddings key, I find that as the record count gets higher, it takes much longer to index a record.

At the start of my load I'm getting about 166 records per second, now 20 hours later I'm getting about 5 records per second. So I'm at about 3% the indexing speed as when my job first started.

Steps to reproduce

I used Nomic's embeddings models to create 64 dimension vector. I'm indexing them using the python module using update.

My table only has the embeddings defined.

DEFINE FIELD embeddings ON TABLE concept TYPE array;
DEFINE INDEX embeddingsIndex ON TABLE concept FIELDS embeddings MTREE DIMENSION 64;

Expected behaviour

I understand there are problems indexing large numbers of vectors but other databases seemed to have solved this problem to some degree since I've indexed large number of documents into other DBs.

At the moment I am at around 630,000 records, I have 1.2M for this run and another 2M in my data store waiting to be processed. After that it will grow at a somewhat predictable rate daily. I'll need to continue to index new records everyday if I am going to go to production with SurrealDB.

SurrealDB version

Running 1.4.2 for linux on x86_64

Contact Details

dustin@edgestep.com

Is there an existing issue for this?

I have searched the existing issues

Code of Conduct

I agree to follow this project's Code of Conduct

Emmanuel Keller · Answer 1 · Thu Apr 25 2024 22:49:33 GMT+0800 (China Standard Time)

@dustyatx Are you using surrealdb on disk or in memory?
If it is in memory, can you check if the server starts using virtual memory? That could also explain why it is slowing down.
Alternatively, for the initial setup, you can ingest the records first without any index, and create the index once all the embeddings are in the table. That should be much faster.

dustyatx · Answer 2 · Fri Apr 26 2024 01:14:22 GMT+0800 (China Standard Time)

@emmanuel-keller I am running it in a docker container and I'm not passing in any parameters that would change the storage, I believe it's using rocksdb. Here is my docker run command.

docker run --rm --pull always -p 8000:8000 -v /home/dusty/data/surrealdb/:/mydata surrealdb/surrealdb:latest start --auth --user dbadmin --pass TEMP_Password_v2 file:/mydata/mygraph.db

I tried to create the index after the data was loaded up but I did not get the sense that it was any faster. The logging didn't give me much insight as to what was going on with generating the index and it seemed to hang after 24 hours where CPU usage was 100 on two of my cores but disk writes were very rare. I ended up stopping the job and restarted using the update in order to get better visibility into what was going, thats when I saw the issue where records would be incrementally slower.

I am using powerful workstation that i'm using is a Intel Core i9 with 24 cores & 32 threads, 128 GB of RAM, 3TB NVME and SurrealDB barely uses a fraction of the resources.

dustyatx · Answer 3 · Fri Apr 26 2024 01:31:46 GMT+0800 (China Standard Time)

Here is my print statements @ 10,000 records increments. As you can see initially I'm getting 10k records in about 40 seconds and now it's 31 minutes.

2024-04-24 13:57:30.273261 - 10000
2024-04-24 13:58:09.078238 - 20000
2024-04-24 13:59:07.639177 - 30000
2024-04-24 14:00:43.998013 - 40000
2024-04-24 14:02:45.179446 - 50000
2024-04-24 14:05:14.992048 - 60000
2024-04-24 14:08:34.039337 - 70000
2024-04-24 14:12:17.087255 - 80000
2024-04-24 14:16:33.725646 - 90000
2024-04-24 14:21:20.126168 - 100000
2024-04-24 14:26:46.882572 - 110000
2024-04-24 14:32:23.712622 - 120000
2024-04-24 14:38:22.135483 - 130000
2024-04-24 14:45:32.485350 - 140000
2024-04-24 14:52:37.347228 - 150000
2024-04-24 15:00:09.342006 - 160000
2024-04-24 15:08:48.239127 - 170000
2024-04-24 15:17:22.392656 - 180000
2024-04-24 15:26:03.096367 - 190000
2024-04-24 15:36:00.042586 - 200000
2024-04-24 15:46:33.113409 - 210000
2024-04-24 15:57:06.067327 - 220000
2024-04-24 16:08:33.301760 - 230000
2024-04-24 16:20:33.705462 - 240000
2024-04-24 16:32:40.535635 - 250000
2024-04-24 16:45:21.863231 - 260000
2024-04-24 16:58:53.158460 - 270000
2024-04-24 17:12:42.266389 - 280000
2024-04-24 17:27:19.248467 - 290000
2024-04-24 17:42:47.452241 - 300000
2024-04-24 17:59:27.655567 - 310000
2024-04-24 18:17:17.746319 - 320000
2024-04-24 18:33:53.909339 - 330000
2024-04-24 18:52:57.680928 - 340000
2024-04-24 19:14:39.289043 - 350000
2024-04-24 19:35:08.270277 - 360000
2024-04-24 19:54:43.017115 - 370000
2024-04-24 20:13:42.961264 - 380000
2024-04-24 20:33:20.347295 - 390000
2024-04-24 20:56:59.894966 - 400000
2024-04-24 21:17:48.101210 - 410000
2024-04-24 21:38:56.231222 - 420000
2024-04-24 22:00:55.000092 - 430000
2024-04-24 22:23:35.493170 - 440000
2024-04-24 22:46:04.863907 - 450000
2024-04-24 23:09:47.498423 - 460000
2024-04-24 23:33:52.817692 - 470000
2024-04-24 23:58:12.698891 - 480000
2024-04-25 00:22:59.267369 - 490000
2024-04-25 00:47:50.491335 - 500000
2024-04-25 01:12:33.041361 - 510000
2024-04-25 01:38:09.916026 - 520000
2024-04-25 02:04:50.250077 - 530000
2024-04-25 02:31:41.127904 - 540000
2024-04-25 02:58:48.615762 - 550000
2024-04-25 03:27:25.706567 - 560000
2024-04-25 03:56:15.218333 - 570000
2024-04-25 04:25:21.620937 - 580000
2024-04-25 04:55:01.354357 - 590000
2024-04-25 05:26:04.941622 - 600000
2024-04-25 05:57:04.440315 - 610000
2024-04-25 06:28:57.347920 - 620000
2024-04-25 07:00:32.660273 - 630000
2024-04-25 07:31:09.560997 - 640000
2024-04-25 08:02:39.673085 - 650000
2024-04-25 08:35:36.235938 - 660000
2024-04-25 09:08:33.912291 - 670000
2024-04-25 09:42:15.802778 - 680000
2024-04-25 10:17:10.817673 - 690000
2024-04-25 10:52:20.507817 - 700000
2024-04-25 11:27:45.918646 - 710000
2024-04-25 12:04:25.350337 - 720000

Emmanuel Keller · Answer 4 · Fri Apr 26 2024 17:38:32 GMT+0800 (China Standard Time)

The fact that the ingestion process slows down as the index grows is expected. However, it is crucial that the initial indexing can be completed within a reasonable amount of time. Here are a few things you can do to speed up the initial indexing.

We are currently testing our MTREE implementation with the ANN-BENCHMARK, and we have already identified a few improvements:

Insert by batch

Ensure you are inserting the embeddings by batch. With a dimension of 64, it is probably possible to insert 300 records per transaction. SurrealDB is a transactional database, which comes with the benefits of ACID properties but also bears a cost for each request. During the initial indexing, you can group your records. Each request should create a batch of records, and you can also avoid SurrealDB returning the value by using RETURN NONE. Each request should look something like this:

CREATE concept:1 SET embeddings=[1.0, 1.1, 1.2, ...] RETURN NONE;
CREATE concept:2 SET embeddings=[1.0, 1.1, 1.2, ...] RETURN NONE;
CREATE concept:2 SET embeddings=[1.0, 1.1, 1.2, ...] RETURN NONE;

MTREE optimisation

The MTREE index has a few parameters that will impact read and write operations differently. The index is a tree composed of nodes. By default, the size of a node is 40. You may try to increase or decrease this value to see how it impacts the ingestion process (in a range between 20 and 1000).

DEFINE INDEX embeddingsIndex ON TABLE concept FIELDS embeddings MTREE DIMENSION 64 CAPACITY 80;

Vector type

This is not yet available in version 1.4.2; it will probably be released in 1.5 beta

Depending of your data, you may want use the most appropriate type. By default we are using F64. Usually simple precision with F32 is more than enough (F64 is only rarely required). It will typically divide by 2 the memory usage.

DEFINE INDEX embeddingsIndex ON TABLE concept FIELDS embeddings MTREE TYPE F32 DIMENSION 64 CAPACITY 80;

--

I am also currently conducting these tests and will provide you with some feedback later today.

Emmanuel Keller · Answer 5 · Fri Apr 26 2024 19:05:39 GMT+0800 (China Standard Time)

I just removed the previous suggestions on cache parameters. I am reassessing these options, as my initial partial results seem to be quite counterintuitive. I will provide more feedback about this once my benchmarking is complete.

dustyatx · Answer 6 · Fri Apr 26 2024 20:15:00 GMT+0800 (China Standard Time)

Thank you for all the great information, very useful insights.

For clarification I did try both setting the index after the records were loaded and batching. The only reason why I used update was to get more visibility into what was going on. I do run into parsing issues with the batching of records, my best guess is there are characters in the text that break JSON but I haven't had time to troubleshoot that so it's a best guess. But I didn't try just batching the embeddings in without the other fields. I can try that as well.

The MTrees node size seems like the most likely cause, the records ingest very quickly without the index on. I'm guessing that the issue I have now is due to node sizes being to small, so as the tree becomes larger it takes more and more time to traverse it. I'll see what happens when I increase the size.. Please let me know how your benchmarking goes. I can't really spend a lot of time on this as I'm already behind on this project. So any insights that gets me past this quickly will be most helpful.

Emmanuel Keller · Answer 7 · Fri Apr 26 2024 21:35:36 GMT+0800 (China Standard Time)

Leave it to me; I will come up with solutions. I am able to reproduce the issue and have identified some caveats. Two questions:

How do you ingest records into the database? Which client (Python, Node, Rust?) and which API are you using?
Do you need exact k-NN, or would you be satisfied with approximate k-NN? We are releasing HNSW in a few days, and since the index resides in memory, its performance level is much faster than MTree.

dustyatx · Answer 8 · Sat Apr 27 2024 05:44:08 GMT+0800 (China Standard Time)

How do you ingest records into the database? Which client (Python, Node, Rust?) and which API are you using?

I use the Python module.

I'm not sure what you mean by API but I'm going to guess you mean embeddings; if so I'm using a open model from Nomic. They've optimized the model for easily truncating dimensions, which let's me use between 768 & 64 dimensions. Ideally I'll be able to use a mix of larger and smaller embeddings, so longer text can get the benefit of 768 and smaller ones like classifications (I have 1.7M) will use the smaller 64.

from sentence_transformers import SentenceTransformer embedding_model = SentenceTransformer("nomic-ai/nomic-embed-text-v1.5", trust_remote_code=True)

Do you need exact k-NN, or would you be satisfied with approximate k-NN? We are releasing HNSW in a few days, and since the index resides in memory, its performance level is much faster than MTree.

I have been using HNSW with another DB and I think it's doing a great job. I haven't done any real testing against different indexes, so I don't have a point of reference. At this point, I'd be happy with anything that unblocks me from testing SurrealDB. I think it's a next generation Graph and has huge potential, I'm excited to see if it's a good fit for our product.

Emmanuel Keller · Answer 9 · Mon Apr 29 2024 20:47:10 GMT+0800 (China Standard Time)

This issue emphasises the fact that our current cache implementation works only on reads. On writes, the cache is flushed on every transaction. The index tree has to be partially reconstructed (and deserialized from the KV store) on each request.

I have opened a PR that implements this missing part. It mainly maintains the cache in memory between transactions. In my current tests (ANN Benchmark with 100K embeddings with a vector dimension of 100) I observe an improvement close to 3 in writing performances.Hopefully, this PR will be merged within one day or two, and you may try our nightly version.

#3954

At the same time, we are in the final process of merging HNSW. That should land on nightly too this week. This is probably better aligned with your goal. HNSW is much faster. On my MacBook Pro (M2 12 cores, 64GB) I have been able to index SIFT (vector dimension 128, 1 million embeddings) in 24 minutes.

2024-04-29T11:52:01.333605Z  INFO surreal::env: Running 1.5.0+20240426.e64648ef for macos on aarch64
2024-04-29T11:52:01.333685Z  WARN surreal::dbs: ❌🔒 IMPORTANT: Authentication is disabled. This is not recommended for production use. 🔒❌
2024-04-29T11:52:01.333757Z  INFO surrealdb_core::kvs::ds: Starting kvs store in memory
2024-04-29T11:52:01.333799Z  INFO surrealdb_core::kvs::ds: Started kvs store in memory
2024-04-29T11:52:01.334329Z  INFO surrealdb_core::kvs::ds: Credentials were provided, and no root users were found. The root user 'ann' will be created
2024-04-29T11:52:01.358205Z  INFO surrealdb::net: Started web server on 127.0.0.1:8000
Got a train set of size (1000000 * 128)
Got 10000 queries
Ingesting vectors...
1000000 vectors ingested
Index construction done
Built index in 1441.3516681194305

#3353

Both this improvement are going to be part of the next 1.5 release.