Is there a way to avoid unnecessary Arc cloning inside a map_ref?

Question

Is there a way to avoid unnecessary Arc cloning inside a map_ref?

CyrusAlbright opened this issue 2 years ago · comments

In a program I'm working on, an Arc with many common resources is passed around. In some of the reactive tasks I'm making with the SignalExt::for_each() trait method, it is necessary to be able to access this Arc, so I zip it into another Signal with the map_ref macro. This ends up looking like this:

struct GlobalState;

let global_state = std::sync::Arc::new(GlobalState);

let some_mutable = Mutable::new(0);

let combined_signal = futures_signals::map_ref! {
	let some = mutable.signal(),
	let global_state = futures_signals::signal::always(global_state.clone()) =>
	(*some, global_state)
};
// then do things with the combined_signal

This doesn't compile, because the reference to the Arc doesn't live long enough. So to get it to compile, I have to change it to this:

let combined_signal = futures_signals::map_ref! {
	let some = mutable.signal(),
	let global_state = futures_signals::signal::always(global_state.clone()) =>
	(*some, global_state.clone())
};

This clones the Arc every time the mutable signal changes, which is possibly very expensive. Is there a better way to do this that I'm not aware of, or is this the best I can do with the library as-is?

If I'm not mistaken, it should be theoretically possible for the global_state signal which never changes to just store a single Arc and hand out references to it whenever necessary, instead of needing to clone every time. I haven't looked into the soundness super thoroughly, but I just think that this is unnecessarily painful for what is possibly a straightforward and simple change.

Pauan · Answer 1 · Thu Jun 09 2022 01:06:09 GMT+0800 (China Standard Time)

This clones the Arc every time the mutable signal changes, which is possibly very expensive.

Why would it be expensive? Cloning an Arc just increments an integer. It's about the cheapest operation you can get.

Atomic operations do have some overhead, but they're extremely fast (far far far faster than locking). And you can use Rc instead, which doesn't have any atomic operations at all.

I can't think of any situation where the cost of an atomic fetch-add would be the bottleneck in your program. Even hard-realtime code can use atomic operations, since they are so fast and predictable in performance. It looks like the performance of a fetch-add is about 20 nanoseconds (that's 0.00002 milliseconds).

Getting back to your question, instead of wrapping your global_state in always, you can just move it into the map_ref:

let global_state = global_state.clone();
let combined_signal = map_ref! {
    let some = mutable.signal() => move
    (*some, global_state.clone())
};

dominator has a clone! macro which makes this easier:

let combined_signal = clone!(global_state => map_ref! {
    let some = mutable.signal() => move
    (*some, global_state.clone())
});

Since you just have a single Signal, you no longer need map_ref:

let combined_signal = mutable.signal().map(clone!(global_state => move |some| {
    (some, global_state.clone())
}));

Also, it seems odd to me that you're returning the global_state from the Signal. That just feels like a code smell. Why are you returning the global_state like that?

Drew Christensen · Answer 2 · Thu Jun 09 2022 04:16:33 GMT+0800 (China Standard Time)

Just to be clear, cloning an Arc, especially unnecessarily, actually can be very expensive (and as with most things in performance, benchmarking it in isolation will never tell you this), because it's a guaranteed cache miss and an atomic operation (and on some platforms potentially a lock). It's definitely not the same as "just incrementing a number," as you phrased it. In languages like Swift where nearly everything is atomically reference counted, Arc increments and decrements can take up a huge proportion of running program time at around 32% in real use cases (not microbenchmarks).

Second, I was not aware that it was possible to do this with map_ref. Is it possible to better document this in the map_ref macro page? Macros, especially nested macros, are famously hard to decode from the outside.

As to why I'm doing it, it's because I run into the exact same issue if I do SignalExt::for_each() on a signal and need to access the Arc inside the for_each, namely that it needs to clone the Arc every time in order to compile. I was looking for a way to avoid this.

Pauan · Answer 3 · Thu Jun 09 2022 05:03:13 GMT+0800 (China Standard Time)

Arc increments and decrements can take up a huge proportion of running program time at around 32% in real use cases (not microbenchmarks).

That sounds completely ridiculous, I'm sure there is something more going on there than just some atomic fetch-adds. On any sort of non-trivial program, something like a fetch-add shouldn't be such a big bottleneck.

Perhaps Swift is doing something weird, or perhaps it's on a specialized embedded architecture that has really bad atomic performance. So I have no clue what Swift is doing, but Rust is not Swift.

The performance cost of Signals (and the Futures executor system) should completely dwarf something as small as a fetch-add. Even just updating a single Mutable will be 5-10 times slower than a fetch-add (and that's excluding the Signal and Futures machinery):

Mutable add             time:   [5.6362 ns 5.6523 ns 5.6724 ns]
atomic fetch add        time:   [1.5696 ns 1.5815 ns 1.5933 ns]
normal add              time:   [0.2173 ns 0.2185 ns 0.2196 ns]

If you're worrying about fetch-add, then you have much bigger problems, and futures-signals is probably not fast enough for your use case. In fact, any sort of asynchronous system will likely be too slow for you. At that point we're talking about something like a hard real-time program on embedded hardware, where every single cycle matters.

Of course if you're not using multi-threading, then you can use Rc to completely avoid the performance cost of atomics.

So as you say, I recommend running benchmarks to verify whether it's actually affecting performance or not. Rust has quite good benchmark crates.

Second, I was not aware that it was possible to do this with map_ref. Is it possible to better document this in the map_ref macro page? Macros, especially nested macros, are famously hard to decode from the outside.

Yup, better documentation is on my TODO list, I just have to find the time for it.

In the next major version of futures-signals I plan to make move the default, because people (including myself) keep having problems with it.

As to why I'm doing it, it's because I run into the exact same issue if I do SignalExt::for_each() on a signal and need to access the Arc inside the for_each, namely that it needs to clone the Arc every time in order to compile. I was looking for a way to avoid this.

That shouldn't be the case, you should be able to just move it into the closure, like usual:

let global_state = global_state.clone();

some_signal.for_each(move |value| {
    // You can use global_state without cloning
    async {}
}).await;

If you need to use it inside of the async block for some reason, then you will probably need to clone:

let global_state = global_state.clone();

some_signal.for_each(move |value| {
    let global_state = global_state.clone();

    async move {
        // Use global_state inside of here
    }
}).await;

You might be able to avoid the clone, but I haven't tested it.

When writing async code, 99.9+% of the time you must use owned values, because references violate memory safety. This is a restriction on all asynchronous code in Rust, not just futures-signals. Even things like an event listener closure must use owned values.

And since you need to use owned values, move is always needed on closures. And using Arc/Rc is often necessary as well.

Internally Mutable uses an Arc, because the data has to be shared between the Mutable and the Signals. You just can't really avoid Arc/Rc in async code.

However, the performance of async code in Rust is still miles better than any other language, because Arc is quite fast, Futures/Signals/Streams are zero-cost, and the Futures executor system is also very lightweight.

Pauan · Answer 4 · Thu Jun 09 2022 05:30:30 GMT+0800 (China Standard Time)

For fun, I tried benchmarking the full Signal + Future executor system:

Signal futures      time:   [70.119 ns 70.255 ns 70.379 ns]
Signal async-std    time:   [70.835 ns 71.213 ns 71.609 ns]
Signal smol         time:   [69.921 ns 70.059 ns 70.198 ns]

This is just updating a single Mutable, and then receiving the Signal update using a for_each, so it's pretty much the simplest thing you can do:

let m = Mutable::new(5);
let s = m.signal();

{
    let mut l = m.lock_mut();
    *l += 1;
}

s.for_each(|_value| { async {} })

And it's 45 times slower than a fetch-add. Of course 70 ns is still absurdly fast, that's only 0.00007 milliseconds. That's suitable for any situation other than the most hardcore embedded systems.

If you're curious, here is the benchmark code I used.

Pauan · Answer 5 · Thu Jun 09 2022 08:54:24 GMT+0800 (China Standard Time)

By the way, I want to make one thing very clear: even if you are using multi-threading in your program, as long as your global_state is not being used in multiple threads, then you can use Rc for the global_state.

Rust has super-fine-grained multithreading, you can have some parts of your program multi-threaded and some parts not. The compiler will prevent you from accidentally using an Rc across multiple threads, so it can only be used within the thread where it is created.

And most Future executors allow for running non-Send Futures. For example in the futures crate and the wasm-bindgen-futures crate. So there's no issue with using Rc, you only need Arc if you're actually sending that specific value across multiple threads.

And if you do use Rc, then you can't use static to create global variables, instead you have to use thread_local!.