NVIDIA / stdexec

`std::execution`, the proposed C++ framework for asynchronous and parallel programming.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Upper bound on lifetime of operation states

msimberg opened this issue · comments

TL;DR operation states currently have a lower bound on their lifetimes (at least until set_x is called on an associated receiver). Should they have an upper bound?

  • Should some adaptors release operation states as soon as they can? Typical examples of adaptors that could do this are the "two-part" adaptors like schedule_from, let_value, ensure_started, split, when_all.
  • Should all adaptors release operation states as soon as they can?
  • Is it always safe to release operation states early or does it lead to issues elsewhere? The order of destruction is changed, but I don't know if this can cause problems somewhere.

FWIW in HPX and pika we release some operation states early to avoid problems like below. We just rediscovered the problems when trying to use stdexec in place of the current implementation.

A few motivating examples below.

split

Once the the r receiver of of split has received a completion signal, the op_state2 does not necessarily need to be kept around because the values sent by the predecessor have been stored in the split shared state (https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2300r7.html#spec-execution.senders.adapt.split). stdexec currently does keep op_state2 around until the shared state itself is released. This can lead to some, IMO, surprising behaviour.

The following example avoids deep recursion in the "forward" direction by explicitly transfering to new tasks. It creates a DAG that roughly looks like this:

o_1 --> o --> o --> o ... o_n
   \     \     \     \
    o     o     o     o ...

The unintuitive thing is even if the work itself was done without recursing deeply, once all the work is done there will be recursion proportional to n when the operation states of o_n to o_1 are released. On regular OS threads this isn't a huge problem, but the stack is quite easy to exhaust on e.g. fibers where the stack is a lot smaller.

  using any_sender_of_void =
    any_sender_of<ex::set_value_t(int), ex::set_error_t(std::exception_ptr), ex::set_stopped_t()>;
  exec::static_thread_pool pool{4};
  auto sched = pool.get_scheduler();

  std::size_t const n = 10000;
  std::vector<any_sender_of_void> senders;
  senders.reserve(n);
  any_sender_of_void cur = ex::just(1);

  for (std::size_t i = 0; i < n; ++i) {
    auto split = std::move(cur) | ex::split();
    auto transfer1 = split | ex::transfer(sched) | ex::then([](int x) { std::cerr << x << '\n'; });
    auto transfer2 = std::move(split) | ex::transfer(sched) | ex::then([](int x) { return x + 1; });
    cur = any_sender_of_void(std::move(transfer2));
  }

  for (auto& s: senders) {
    // The last sync_wait ends up freeing a whole stack of n split op states,
    // even though they could be freed incrementally because the split op state
    // takes ownership of values sent from predecessors
    stdexec::sync_wait(std::move(s));
  }

schedule_from/transfer

One can trigger the same problem with a simpler example. The previous example was simplified from a real example, which is why it was a bit more complicated.

schedule_from also does not release op_state2 as soon as it could release it (https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2300r7.html#spec-execution.senders.adaptors.schedule_from), leading to deep recursion in the destructors of the operation states, despite avoiding deep recursion in the forward direction with new tasks.

  using any_sender_of_void =
    any_sender_of<ex::set_value_t(int), ex::set_error_t(std::exception_ptr), ex::set_stopped_t()>;
  exec::static_thread_pool pool{4};
  auto sched = pool.get_scheduler();

  std::size_t const n = 10000;
  any_sender_of_void cur = ex::just(1);

  for (std::size_t i = 0; i < n; ++i) {
    cur = any_sender_of_void(
      std::move(cur) | ex::transfer(sched) | ex::then([](int x) { return x + 1; }));
  }

  stdexec::sync_wait(std::move(cur));

let_value

This example shows unintuitive (again, IMO) lifetimes of the values kept alive by the let_value adaptor. (One of) the use cases of let_value is to keep something alive for the duration an asynchronous operation returned by the callable passed to let_value. It does that, but it keeps the values alive until the operation state of let_value is released, which could be much later. The below example uses memory as the precious resource that's kept alive longer than necessary, but the same applies for anything else that you might want to release as soon as possible (file handles, asynchronous locks, etc.)

One can work around this by manually releasing the values, but that requires knowing that the values are kept alive longer to start with.

  exec::static_thread_pool pool{4};
  auto sched = pool.get_scheduler();

  auto s = ex::schedule(sched) | ex::then([]() { return std::vector<int>(1000000, 42); })
         | ex::let_value([](auto& v) {
             return ex::just() | ex::then([&]() { /* do something with v */ });
             // could do this here:
             // | ex::then([&]() { auto v = std::move(v); });
           })
         | ex::then([]() {
             // We are not using v anymore here, but it stays alive in the
             // let_value op state
             std::this_thread::sleep_for(std::chrono::seconds(5));
           });
  // The large allocation is freed only when the sync_wait op state is freed
  stdexec::sync_wait(std::move(s));