Extend AsyncSystem with support for throttling groups, prioritization, and cancelation

Question

Extend AsyncSystem with support for throttling groups, prioritization, and cancelation

kring opened this issue 5 months ago · comments

Motivation

AsyncSystem::runInWorkerThread (as the name implies) runs a task in a worker thread and returns a Future that resolves when it completes. If this work is CPU bound - as it usually is - it's not desirable to run too many such tasks simultaneously, because the overhead of task switching will become high and all of the tasks will complete slowly.

In practice, though, runInWorkerThread is dispatched via some sort of thread pool. In the case of Cesium for Unreal, these tasks are dispatched to Unreal Engine's task graph, which is similar. The thread pool limits the number of tasks that run simultaneously; any extra tasks are added to a queue. As each task completes, the next task in the queue is dispatched.

This scheme is strictly first in, first out (FIFO). Once runInWorkerThread (or similarly, thenInWorkerThread) is called, the task will run eventually (process exit notwithstanding), and multiple tasks dispatched this way will start in the order in which these methods were called. There is no possibility of canceling or reprioritizing a task that hasn't started yet.

On top of this, not all tasks are CPU bound. Cesium Native also needs to do HTTP requests. Network bandwidth, like CPU time, is a limited resource; attempting to do a very large number of network requests simultaneously is inefficient. While thread pools allow AsyncSystem to use CPU time efficiently, there is no similar mechanism for HTTP requests, GPU time, or any other type of limited resource.

These are pretty big limitations when it comes to complicated asynchronous processes like loading 3D Tiles content. To load tile content, we need to:

Do an HTTP GET for the tile content. But we don't want to do too many at once or performance will suffer.
Parse the downloaded tile content and perform various CPU-intensive operations (image decoding, mesh decompression, creating physics meshes, generating normals, etc.) on it to prepare it for rendering. We don't want to do to many of these at once or we'll monopolize CPU cores or game engine task graph time.
If the parsed content contains references to external content (such as a glTF external buffer or image), we may need to do further network requests. Followed by more CPU work.
On the next frame, we may learn that this tile is now more or less important than it was last frame. Or maybe this tile isn't needed at all anymore and any further work should be canceled (for now).

We have ad-hoc ways of doing some approximation of this. Currently, there is a "number of simultaneous tile loads" per tileset. A tile that is doing any kind of loading - whether network or CPU - counts against this limit. This is inefficient in terms of both network and CPU utilization, as described in #473. We also can't cancel or reprioritize tile loads once they're started, as described in #564.

Proposal

This part is a work in progress! I don't think I have all the details right yet.

AsyncSystem should make this sort of thing easy. First, we define a throttling group:

class ThrottlingGroup {
public:
  ThrottlingGroup(int32_t numberOfSimultaneousTasks);
};

We'll have a ThrottlingGroup instance for network requests, and another instance for CPU-bound background work.

We also define a TaskController class that is used to cancel and prioritize an async "task", which is essentially a chain of Future continuations:

class TaskController {
public:
  TaskController(PriorityGroup initialPriorityGroup, float initialPriorityRank);

  void cancel();
  
  PriorityGroup getPriorityGroup() const;
  void setPriorityGroup(PriorityGroup value);

  float getPriorityRank() const;
  void setPriorityRank(float value);
};

The idea is that we can then write code like this:

AsyncSystem asyncSystem = ...;

IntrusivePointer<ThrottlingGroup> pNetworkRequests =
  new ThrottlingGroup(asyncSystem, 20);
IntrusivePointer<ThrottlingGroup> pCpuProcessing =
  new ThrottlingGroup(asyncSystem, 10);

IntrusivePointer<TaskController> pController =
        new TaskController(PriorityGroup::Normal, 1.0f);

AsyncSystem taskSystem = asyncSystem.withController(pController);

pAssetAccessor
    ->get(
        taskSystem,
        pNetworkRequests,
        "https://example.com/whatever.json",
        {})
    .beginThrottle(pCpuProcessing)
    .thenInWorkerThread([asyncSystem, pNetworkRequests, pAssetAccessor](
                            std::shared_ptr<IAssetRequest>&& pRequest) {
      if (doSomeCpuWorkOnResponse(pRequest->response()->data())) {
        return pAssetAccessor
            ->get(
                taskSystem,
                pNetworkRequests,
                "https://example.com/image.jpg",
                {})
            .thenInWorkerThread(
                [](std::shared_ptr<IAssetRequest>&& pRequest) {
                  doSomeMoreCpuWork(pRequest->response()->data());
                });
      }
      return asyncSystem.createResolvedFuture();
    })
    .endThrottle();

AsyncSystem::withController specializes the AsyncSystem for a given task. It allows the continuations created within it to be prioritized and canceled as a group.

beginThrottle returns a Future that resolves when the task should start. This may not happen right away if too many other tasks are already in progress within the throttling group. When the continuation chain reaches endThrottle, the throttled portion of the task is complete and other tasks waiting in the same throttling group may begin (beginning with the one that is now highest priority).

In this example, we do a network request. Then do throttled processing of the response in a worker thread. Depending on the result of some function call, we may need to do another network request, followed by more CPU work.

The overload of IAssetAccessor::get that takes a ThrottlingGroup looks like this:

Future<std::shared_ptr<IAssetRequest>> get(
      const CesiumAsync::AsyncSystem& asyncSystem,
      const IntrusivePointer<ThrottlingGroup>& pThrottlingGroup,
      const std::string& url,
      const std::vector<THeader>& headers) {
  std::shared_ptr<IAssetAccessor> pThis = this;
  return asyncSystem
      .beginThrottle(pThrottlingGroup)
      .thenImmediately([pThis, asyncSystem, url, headers]() {
        return pThis->get(asyncSystem, url, headers);
      })
      .endThrottle();
}

So the network requests happen in one throttling group, while the CPU processing happens in another. When a continuation chain reaches a beginThrottle, the task exits the current throttling group (if any), and enters the new one. When the continuation chain reaches the endThrottle, the previous throttling group is re-entered.

Brian Langevin · Answer 1 · Tue Feb 20 2024 01:28:32 GMT+0800 (China Standard Time)

This idea seems sound, and first reaction is "Why not?".

If your code is arranged in such a way to take advantage of throttling groups, then go for it. But that "if" is really my only criticism. If this idea is built out, will it be useful?

From what I've learned in this PR, building a stage loading pipeline, the bulk of the work was refactoring. Code needs to be structured in such a way that it can be throttled. The actual "throttling" part wasn't all that sophisticated. For example, this function throttles a pending queue of content requests to the AssetAccessor:get call. Fairly small, easy to understand.

Would it be worth refactoring that to use ThrottlingGroups? I'm not sure. Would it be useful for someone else? Maybe.