Shared memory storage of BptreeMap

Question

Shared memory storage of BptreeMap

NewbiZ opened this issue 2 years ago · comments

Are there plans, or woult it actually be safe currently, to store a BptreeMap in a shared memory?

I would assume the underlying structures are full of references in the local thread address space, but having these structs only rely solely on object local indices would allow the storage in shared memory, and thus opening the doors to readers as separate processes.

In my use case, I am receiving 24/7 streaming data and perform some stateful analytics on it, that I share with multiple reader threads via BptreeMaps. The issue I an facing is that there are frequent changes to the readers: new ones, deprecated ones, changes, etc, and I cannot afford to rebuild, redeploy and restart the server every time, that is too much downtime and lost data. Having a single reliable server exposing BptreeMaps in a shared memory where separate, autonomous, reader processes can come and go at will would be a perfect solution.

Firstyear · Answer 1 · Fri Oct 28 2022 08:31:27 GMT+0800 (China Standard Time)

This would be unsafe for a variety of reasons, mostly because the seperate reader processes would need to agree on the layout of the key/structs you are storing in the bptreemap. by default there is no way to achieve this as a trait bound to force repr(C) or similar between processes. You'd probably need serde to serialise/deserialise, then you have to account for data version changes formats etc.

All of that aside, using this structure in shared memory is not possible as-is today. The primary reasons are locking of the root node during transaction creation, which would need to change to a multi-process lock, and second that Arc in Rust is likely not designed to be backed on shared memory meaning that the reference counts in some cases would not sync and that is a required property of linear drop copy on write cell (lincowcell). We have a long standing issue to change to hazard pointers that would eliminate this reliance on Arc however and be more suited to what you want to achieve here.

As a result, this would likely have to be a pretty large rewrite to support mapping and coordination into shared memory.

It's also worth noting that at the point you are using processes rather than threads, many of the benefits of this structure go away. Within a threaded process, the benefit of a copy-on-write structure like this is that tree nodes fit a cacheline, and they can be marked Shared by your CPU so they are viable on many threads in parallel. When you split this to processes you may have the same "tree nodes" in many processes, but they are seperate cache lines and thus become un-shared, so you have less cache sharing. This means duplicate data in your cache so less overall storage of the data in your cache.

You may be better off with something like https://crates.io/crates/lmdb-rkv-sys instead which is a multi-process database. LMDB actually inspired some of the ideas in this crate, but this crate is targeted at memory-only datastructures without persistence.

In a way you have to choose - do you want "the best in performance" then you need to probably find a way to consider how you work with a threaded process because this is the optimal way to work with a modern CPU and it's behaviour. If you want "processes" for developer convenience, then you probably will be sacrificing some performance for this and should use an embedded database.

Hope that helps :)

Aurélien Vallée · Answer 2 · Fri Oct 28 2022 09:25:10 GMT+0800 (China Standard Time)

This would be unsafe for a variety of reasons, mostly because the seperate reader processes would need to agree on the layout of the key/structs you are storing in the bptreemap. by default there is no way to achieve this as a trait bound to force repr(C) or similar between processes. You'd probably need serde to serialise/deserialise, then you have to account for data version changes formats etc.

Well yes, though I think this is more of a theorical problem than a practical one. The very same happens for any networking protocol as well, you need to make sure producer and consumer are well aware of the memory layout of the shared structs. In practice though, this setup is very tailored toward a single stable producer, and a multitude of autonomous readers. That is, it is expected that the shared structs pretty much never change, and any change will require a rebuild and restart of both the producer and consumers. I think overall it is an okay invariant to trust the developer to ensure by himself. This kind of design is typically indeed implemented with a __attribute__((packed)) in C, or similar repr(C) I guess in Rust.

It's also worth noting that at the point you are using processes rather than threads, many of the benefits of this structure go away. Within a threaded process, the benefit of a copy-on-write structure like this is that tree nodes fit a cacheline, and they can be marked Shared by your CPU so they are viable on many threads in parallel. When you split this to processes you may have the same "tree nodes" in many processes, but they are seperate cache lines and thus become un-shared, so you have less cache sharing. This means duplicate data in your cache so less overall storage of the data in your cache.

I would tend to disagree with you here, but I have very little experience with Windows so take my argument with a grain of salt. First, there is very little difference between a thread and a process on Linux. Apart from some accounting structures related to process groups and parent IDs, a thread is pretty much just a new process with an identical memory mapping. Second, physical memory is guaranteed to be contiguous within a memory page, and align on cache line boundaries, so I don't see how mapping these pages in multiple process would cause invalidation. Cache lines are bound to physical addresses, and TLBs are bound to pages.

In a way you have to choose - do you want "the best in performance" then you need to probably find a way to consider how you work with a threaded process because this is the optimal way to work with a modern CPU and it's behaviour.

Again I strongly disagree here.

I understand your concerns about the codebase though, and the impracticality of making these changes, so I will stick to a single multithreaded process :)

Firstyear · Answer 3 · Fri Oct 28 2022 09:30:58 GMT+0800 (China Standard Time)

Well yes, though I think this is more of a theorical problem than a practical one. The very same happens for any networking protocol as well, you need to make sure producer and consumer are well aware of the memory layout of the shared structs. In practice though, this setup is very tailored toward a single stable producer, and a multitude of autonomous readers. That is, it is expected that the shared structs pretty much never change, and any change will require a rebuild and restart of both the producer and consumers. I think overall it is an okay invariant to trust the developer to ensure by himself. This kind of design is typically indeed implemented with a __attribute__((packed)) in C, or similar repr(C) I guess in Rust.

While it's possible to resolve with repr(C) there is no way to enforce that at compile time via a trait, meaning there is no way to enforce that a user is correctly using the api. Thus it would all need to be marked as "unsafe" if it were to exist.

I would tend to disagree with you here, but I have very little experience with Windows so take my argument with a grain of salt. First, there is very little difference between a thread and a process on Linux. Apart from some accounting structures related to process groups and parent IDs, a thread is pretty much just a new process with an identical memory mapping. Second, physical memory is guaranteed to be contiguous within a memory page, and align on cache line boundaries, so I don't see how mapping these pages in multiple process would cause invalidation. Cache lines are bound to physical addresses, and TLBs are bound to pages.

I would want to do some more research and testing here to be certain.

In a way you have to choose - do you want "the best in performance" then you need to probably find a way to consider how you work with a threaded process because this is the optimal way to work with a modern CPU and it's behaviour.

Again I strongly disagree here.

I understand your concerns about the codebase though, and the impracticality of making these changes, so I will stick to a single multithreaded process :)

Yeah, I think if you want "multi-process" you should look at LMDB instead, it's better for what you want here. :)

A way to make this work with "single process" would be to dl-open plugins and load/eject them, but that also needs some repr(C) and unsafe.

Aurélien Vallée · Answer 4 · Fri Oct 28 2022 09:57:50 GMT+0800 (China Standard Time)

Disclaimer: Just talking for the sake of an interesting discussion here, I understand the change is too impactful in terms of research time and architecture changes.

While it's possible to resolve with repr(C) there is no way to enforce that at compile time via a trait, meaning there is no way to enforce that a user is correctly using the api. Thus it would all need to be marked as "unsafe" if it were to exist.

Indeed I guess the easiest solution if you want to enforce this would be to have a macro output a hash of the memory layout of a struct (e. g. type and padding of each field), and check at runtime that the process-local struct hash matches the shared struct hash. This could be done once when the struct is mapped.

I would want to do some more research and testing here to be certain.

I think a good start is to compare the clone parameters of fork vs pthread_create. For what it's worth, you could very well force the same virtual memory mapping between producer and consumer processes for the shared structs, at a fixed address in both. I guess theorically with a custom allocator for the shared structs, and the guarantee that no external references are made in these structs, you could contain all allocations to a dedicated set of pages that could be fix mapped in other processes and have it just magically work. I don't have enough knowledge of Rust to be sure that it would not dangle references all around though.

Yeah, I think if you want "multi-process" you should look at LMDB instead, it's better for what you want here. :)

Sure I will have a look, though it looks like a redis-like solution, and as I am looking at the lowest possible latency solution, I'm weary of introducing a process in between producer and consumer.

Firstyear · Answer 5 · Tue Nov 01 2022 13:32:30 GMT+0800 (China Standard Time)

Disclaimer: Just talking for the sake of an interesting discussion here, I understand the change is too impactful in terms of research time and architecture changes.

While it's possible to resolve with repr(C) there is no way to enforce that at compile time via a trait, meaning there is no way to enforce that a user is correctly using the api. Thus it would all need to be marked as "unsafe" if it were to exist.

Indeed I guess the easiest solution if you want to enforce this would be to have a macro output a hash of the memory layout of a struct (e. g. type and padding of each field), and check at runtime that the process-local struct hash matches the shared struct hash. This could be done once when the struct is mapped.

You'd need it to be a trait though, so it could be a proc-macro that adds the trait or similar I guess?

I would want to do some more research and testing here to be certain.

I think a good start is to compare the clone parameters of fork vs pthread_create. For what it's worth, you could very well force the same virtual memory mapping between producer and consumer processes for the shared structs, at a fixed address in both. I guess theorically with a custom allocator for the shared structs, and the guarantee that no external references are made in these structs, you could contain all allocations to a dedicated set of pages that could be fix mapped in other processes and have it just magically work. I don't have enough knowledge of Rust to be sure that it would not dangle references all around though.

Well, that's pretty much what happens already. We have a set of pages that are touched in a transaction and we reference / track when they can be cleaned in order. So you'd just put the hazard pointer table in the same shared memory as all the keys/values. The big thing would actually also be enforcing that users are putting they keys/values in the same shared memory, else some processes couldn't see values inserted by another. So you'd need to have a way to enforce that the same allocator was used (I think the core/std library has ways to do this so I think that's already possible).

I think if I were to undertake this, it'd be a "new" version of the btree rather than trying to adapt this one, that way all these low level allocator issues are taken care of from the start. If it happens that the "new" version works with single process by swapping the allocator back to the primary global one, then that'd be a win too. Sadly the enemy of all things is time :(

Yeah, I think if you want "multi-process" you should look at LMDB instead, it's better for what you want here. :)

Sure I will have a look, though it looks like a redis-like solution, and as I am looking at the lowest possible latency solution, I'm weary of introducing a process in between producer and consumer.

Yeah, for lowest latency, you need threads and single process IMO.