CesiumGS / cesium-native

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Reduce network request gaps when loading tiles

csciguy8 opened this issue · comments

When instrumenting cesium native code, I've recently discovered some opportunities to improve network performance by reducing some apparent "gaps" when loading tiles.

Background

Below is a simplified diagram of how a tile is loaded. A worker thread fetched data needed for the tile, then processes it in some way that is useable to the native runtime that needs it (Ex. Unreal).
Load Gap - Diagram 1

We do this across multiple workers, in parallel, to achieve faster load times (configured with maximumSimultaneousTileLoads).
Here is an example of what multiple workers loading tiles could look like...
Load Gap - Diagram 2

While the workers are effectively busy 100% of the time, you may notice a gap between when a worker finishes downloading a tile and when it starts downloading the next one.
Load Gap - Diagram 3

Even though parallel fetches can help gaps in network utilization, it is still possible that network utilization is underutilized.

In the previous example, we configured 4 workers. You might expect that 4 network requests would always be in flight, but that's not the case. Notice the period of inactivity in the middle of the load.
Load Gap - Diagram 4

Ideally, we would batch the network requests as tightly as possible, to maximize network throughput.

Here is an alternate scheme where network requests are batched together as tightly as possible, with the processing work queued to different threads.

image

Notice the network inactivity event is gone and all workers are fetching for a longer, more contiguous block of time. Also, processing work is more densely packed among the tile workers, which may open more chances for memory cache hits or batching optimizations.

Proposed work

  • Start investigation at Tileset::_processWorkerThreadLoadQueue. This is where all potential tile work is known and parallel work is throttled with maximumSimultaneousTileLoads
  • Refactor TilesetContentManager::loadTileContent to separate the network fetch (CachingAssetAccessor::get) from the data post processing work.
  • Queue network fetch work together, potentially reusing maximumSimultaneousTileLoads to configure our maximum parallel network fetches
  • Data post processing work should execute as network fetch work completes. The best way to achieve this can be decided later, although the previous diagram hints at a separate pool of tile processing workers.

Benefits

  • Reduced total loading time
  • More consistent peak network usage during a loading event
  • More predictable scaling of maximumSimultaneousTileLoads. This now corresponds directly to parallel network requests

Reference

This work hints at moving parts our code towards a more "Data Parallel" perspective, where parts of our tile loading can continue to be broken down into small parallelizable tasks, with an emphasis on batching and throughput.

This ticket is very similar, with more ideas related to short vs long running tasks, #473

Here is data from the original investigation showing potential gaps in a Google 3D Tiles test (Chrysler Building, 828 tiles)
The highlighted row showed a tile that took 228 ms to complete, with a 26 ms gap where it was not fetching data from the network (gapUsecs).
Load Gap Analysis - Chrysler Release

Preliminary exploration is encouraging...

Have a branch that has isolated the "content fetch" part of tile loading work and dispatches all this work together, and as tightly as possible.

Testing with the Google Tiles test "LocaleChrysler" yields about a 15% reduction in total load time. Not all coding is finished, so there might be more gains when all is done.

https://github.com/CesiumGS/cesium-native/tree/network-work-refactor

From very, very quickly skimming over the PR, it looks like it might be related to some discussion in an older issue. Maybe that impression is wrong, but that linked comment specifically referred to the loadTileImage function, and I've seen a new function called getLoadTileImageWork in the changes, and the linked comment raised some questions about the maximumSimultaneousTileLoads that was also mentioned here, so .... there seems to be some connection, at least.

Applied do my understanding of what is addressed in the PR, it looks like the "promise chain" of

loadTileImage(...) {
  mapRasterTilesToGeometryTile(...) {
    getQuadtreeTile(...) {
      loadQuadtreeTileImage(...) {
        loadTileImageFromUrl(...) {
          this->getAssetAccessor()->requestAsset(...)
        }.then...
      }.then...
    }.then...
  }.then...
}

that was quoted in the linked comment is broken into something like

BlockingQueue networkTasks;
BlockingQueue processTasks;
ThreadPoolExecutor networkWorkers(networkTasks);
ThreadPoolExecutor processWorkers(processTasks);

void request(String uri) {
  networkTasks.add(() -> {
    Data data = networkFetch(uri)
    processTasks.add(() -> {
        ProcessedData processedData = process(data);
        sendToRenderer(processedData);
    });
  });
}

Or in a less pseudocode-y way: There is one queue for "network tasks" and one for "processing tasks". Each of them is worked off by a worker thread pool. Whenever a "network task" is done, the resulting data is thrown into the list of "processing tasks". Both worker pools are busy when there is something to do, and idle when not.

If this is roughly correct, then I'll sneak a 👍 in here.

Some of the diagrams above might be a bit misleading: These orange 'Network Fetch' blocks suggest that the workers are 'busy', but they actually are not. Most of the time, they are doing nothing except for waiting for a network response. So ... it shouldn't really matter whether the number of "networkWorkers" is 2*numCpus or 10*numCpus, because they are not really doing any work, but are only intended for hiding latencies. (In contrast to that, the "processWorkers" are for parallel execution, and these will keep the CPUs busy...)

There is one queue for "network tasks" and one for "processing tasks". Each of them is worked off by a worker thread pool. Whenever a "network task" is done, the resulting data is thrown into the list of "processing tasks". Both worker pools are busy when there is something to do, and idle when not. If this is roughly correct, then I'll sneak a 👍 in here.

Correct! and please do.

You're right, that diagram can misleading. Really the intent was to show the missed opportunity where a network request could be in flight, but was not. As far as actual CPU efficiency, yes, a network request shouldn't do much work at all, much less create its own thread. In Unreal Engine, all requests get queued up and are polled by one thread anyway. The callers are just waiting for completion events.

I'm likely thinking about the problem in a similar way to your linked discussion...

It does not free us from the burden of thinking about the granularity of tasks. And it's probably not a good thing to "bake" a certain granularity into the architecture

Most of the work in this PR is separating CachingAssetAccessor::get from the rest of the chained logic.

Basically, the code is...
we have one work item, let's do A then B then C...
and I'm trying to move it to...
how many A work items do we have? let's do them all together first
It doesn't do this fully, but enough to "reduce request gaps", and see a ~18% reduction in load time in my Unreal tests.
(27% reduction if I bump maximumSimultaneousTileLoads to 24)

Suffice to say, even if this PR seems like a great idea and gets merged, there's still more work we can do here.