orbitdb / orbitdb

Peer-to-Peer Databases for the Decentralized Web

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

database record limits for acceptable lookup performance

koh-osug opened this issue · comments

I see that all the database implementations are using the iterator from the for finding and querying records. E.g. the document based implementation for get is using the iterator until it finds a matching key. I assume this linear search does not scale for several 10000 or million records. What are the limits here to have an acceptable access performance? Are there any memory recommendations and maximum number of records?

I have tested this with 10000 simple documents. It took 1 min and 43 seconds to insert the data which is pretty slow. Syncing the data to a different node took 57 seconds. Searching the document 5000 in the middle took 26 seconds. Any ideas how to improve this?
Since the search is linear I would assume then search in 100 000 documents would need 260 seconds.

What database types are you testing against? If you are using KeyValue what happens if you use KeyValueIndexed?

Also, can you describe your benchmarking environment in more detail? For example, is this on Node.js or browser?

Any addtional information you can provide about your benchmarking will help us set up something similar in our own benchmarks.

This is a test project: https://github.com/koh-osug/orbitdb-benchmark

I'm using node.js, version v20.11.0. I'm using the document store.

koh-osug/orbitdb-benchmark#1 (comment)

I was looking into some performance issues as well

see also: ipfs/kubo#10383

Regarding write speed:

We have noticed an improved write speed when changing entry, index and heads storage to MemoryStorage (from LevelStorage). Could be related to how LevelStorage stores to disk. This will require further investigation.

Regarding query speed:

The legacy "useRefs" has been removed. This should result in a quicker lookup because it isn't needing to follow various ref paths.

We've also noticed increasing LRU cache size from 1000 to 10000 improves subsequent queries.

Could you please install the next version of OrbitDB which removes useRefs and re-run your benchmarks to see if you are also seeing improved query times.

Since the search is linear I would assume then search in 100 000 documents would need 260 seconds.

Yes you are correct. LevelDB (which is used for the index) does what it is intended to do; get an item quickly. However, it isn't so great at searching text. For this, something else would need to be used, something that is designed for text querying. Whether this is something that is integrated into OrbitDB or whether an external indexing solution is used is up for discussion.

Regarding write speed:

We have noticed an improved write speed when changing entry, index and heads storage to MemoryStorage (from LevelStorage). Could be related to how LevelStorage stores to disk. This will require further investigation.

Regarding query speed:

The legacy "useRefs" has been removed. This should result in a quicker lookup because it isn't needing to follow various ref paths.

We've also noticed increasing LRU cache size from 1000 to 10000 improves subsequent queries.

Could you please install the next version of OrbitDB which removes useRefs and re-run your benchmarks to see if you are also seeing improved query times.

Since the search is linear I would assume then search in 100 000 documents would need 260 seconds.

Yes you are correct. LevelDB (which is used for the index) does what it is intended to do; get an item quickly. However, it isn't so great at searching text. For this, something else would need to be used, something that is designed for text querying. Whether this is something that is integrated into OrbitDB or whether an external indexing solution is used is up for discussion.

I was intending to figure out K Nearest Neighbors search personally, I will look into the changes you've referenced here.

I was working on sharding orbitdb, and it took 8 shards 100 seconds to do 32k inserts on a Xeon E5 V4 2690, i tried to open the slave on my laptop on the same LAN using only mdns, but the process failed with a "want for [random string] arborted", i will try tomorrow with a direct dial, and will try to record how long it takes the 32k records to be sent to the slave database, which is the source for which the KNN search will be done after the 32k records are ingested.