dgraph-io / badger

Fast key-value DB in Go.

Home Page:https://dgraph.io/badger

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[QUESTION]: one db relying on prefix scans vs one db per "collection"?

lrondanini opened this issue · comments

commented

Question.

Apologies for writing here, I tried discord but got no help.

Hi everybody, Im new to badger and I have a quick question......how should I manage collections/tables with badgerdb? should I crete a db for each table or should I use prefix scan and store everything in the same db? Assuming "tables" are user's defined and can grow indefinitely (for small data sets the answer is obvious)

Thanks a lot

BadgerDB is a key-value database, which means it doesn't have a concept of "tables" or "collections" like in a relational database. Instead, all data is stored in a flat structure, where each data item is associated with a unique key.

To simulate the idea of tables or collections, you can use prefixes on your keys. For example, if you have "customers" and "orders" data, you could prefix your keys with "customers:" and "orders:", respectively. Then, when you want to fetch all data related to "customers", you can do a prefix scan for "customers:".

Creating a separate instance of BadgerDB for each "table" or "collection" is not recommended. This would be inefficient in terms of resource usage and could unnecessarily complicate your application. But you can try, more work for you.

Even if your "tables" can grow indefinitely, it is still recommended to use a single instance of BadgerDB and manage different "tables" or "collections" using key prefixes.

commented

Thanks a lot Michel, that makes sense.

But - if I'm correct - using prefixes means that every time I have to find "orders" by range, I will need to scan all the orders (or at least till I find the ones I was searching for) - In other words, I cannot rely on seek to have a direct entry point for a search

To give you some context, Im building a sort of a Cassandra clone. I can handle the complexity of having a BadgerDB per collection but I'm really scared it would be extremely resource intensive. Can you give me an idea about this? I'm assuming BadgerDB stores only the keys in RAM, am I right?

You are right about that, using prefixes means that each time you need to find "orders" by range, you will need to scan through all the "orders". However, Badger is optimized for these types of prefix scan operations and is quite efficient at it. From what I remember, we use prefixes in the multi-tenant logic in Dgraph. We use zeros at the front as a prefix for each tenant. We have shared instances in our Cloud with several users and the impact is minimum.

Badger does not store all keys in RAM. Instead, it uses a structure called LSM tree to store data. The LSM tree is optimized for heavy write operations and minimizes the amount of data held in RAM. However, Badger does keep some metadata in memory to optimize read operations. This metadata includes information about the structure and contents of the database that can help Badger quickly locate the data it needs to satisfy a read request.

For instance, Badger maintains an in-memory index of each SSTable's blocks and LSM tree map structure in memory. This index allows Badger to quickly locate the block that might contain a given key, avoiding unnecessary disk reads. By keeping information about the LSM tree in memory, Badger can quickly navigate the tree to find the data it needs.

These optimizations do consume some memory, but they can significantly improve read performance, especially for workloads that involve frequent random reads.

commented

Thanks a lot Michel, really helpful