tensorflow / minigo

An open-source implementation of the AlphaGoZero algorithm

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Parallelism and concurrency overview

marcinbogdanski opened this issue · comments

Hi

I'm further reading the source code, wanted to clarify my understanding of concurrency/parallelism implementation is correct.

Inference options:

  • FakeDualNet - returns fixed policy/value
  • LiteDualNet - TFLite integer inference, why? faster CPU inference? play against Minigo on mobile?
  • RandomDualNet - returns random policy/value
  • TFDualNet - standard TensorFlow CPU/GPU inference
  • TPUDualNet - TPU inference
  • WaitingModel - for testing
  • ModelBatcher - request asynchronous mini-batch inference from BufferedModel
  • BufferedModel - runs in it's own thread, does inference possibly combining mini-batches from multiple ModelBatcher

Concurrency/parallelism sources:

  • distributed execution - Kubernetes orchestrates multi-machine cluster execution
  • parallel self-play on single machine - Selfplayer manages multiple threads
  • within single thread:
    • concurrent execution of multiple games: select-leafs (across games), inference (batch), update-trees (across games)
    • virtual-losses within single tree, as managed by SelfplayGame
    • the number of game-states in single mini-batch is: virtual_losses * concurrent_games_per_thread
    • mini-batches can be evaluated synchronously or asynchronously (via ModelBatcher/BufferedModel)

Reasons for above architecture is as follows:

  • distributed - obviously for scalability
  • parallel self-play:
    • in principle, allow parallel execution on CPU/GPU?
    • while some CPU threads are blocked waiting for GPU, other can generate next mini-batch?
  • concurrent games in single thread:
    • so number of threads doesn't get out of hand when generating large mini-batches for GPU?
  • virtual-losses
    • required for ranked tournament games (only one game played, so all other methods are not applicable)?

Is above correct? Especially are there any other reasons for this setup, that I'm missing?

Also, I don't seem to see any way to execute single tree in parallel for tournament game (e.g. against human champion)?

Also 2, I will have possibly more questions, what would be a preferred communication channel for not-an-issues? Should I keep creating GitHub issues?

Thanks again for your time

re: the second part, happy to have these documented on github and available for posterity :) Feel free to keep opening issues!

Your understanding is spot on.

ModelBatcher and BufferedModel are deprecated and will be deleted once I find the time to rewrite cc/eval.cc.

With regards to threading, inference is always performed synchronously from the point of view of the selfplay thread. Because we run Minigo selfplay at scale by playing multiple games concurrently, we don't need to asynchronous inference to achieve good GPU utilization. Even on very small models, the engine runs at >95% utilization on a v100, and close to 100% for a full sized model.

It has never been the goal of the project to write the fastest tournament engine (I doubt we'll ever enter one), so as you guessed this means the engine is somewhat slower when playing a single game than something like Leela.

What is the ratio of number of concurrent_selfplay executables to number of GPUs in the system? Is it 1:1 during normal execution?

Just to confirm my understanding (sorry for silly questions): assuming ratio is 1:1, and considering synchronous CPU/GPU execution, after GPU compute finishes there is a small gap in GPU utilisation while CPU does it thing to advance games and prepare next batch. But because in Go neural network is fairly large (GPU compute takes long) and game logic is comparatively quick, the GPU utilisation gap is small and thus non issue?

Thanks!

Gentlemen, could you confirm my further analysis of concurrent_selfplay are correct?

  • Selfplayer: always single instance, which:
    • in constructor creates ShardedExecutor with a thread pool, which will be used later
    • in Selfplayer::Run() (run from thread of the executable):
      • creates inference cache
      • creates abort file watcher
      • creates bunch of Models in InitializeModels, but does not spin new threads (even for BatchingModel/ModelBatcher/BufferdModel)
      • creates multiple SelfplayThread
      • creates single OutputThread

Then, in each SelfplayThread::Run:

  • execute multiple tree searches in parallel via ShardedExecutor created earlier
  • execute synchronously GPU inference via one of the model types (TF, TPU, etc.)

As for parameters in ml_perf/flags/19/selfplay.flags:

  • selfplay_threads=3 - number of game threads, a thread is blocked on GPU inference
  • parallel_search=4 - number of threads in ShardedExecutor to execute tree searches in parallel
  • parallel_inference=2 - number of models, maximum number of GPU inferences to run simultaneously, too high would cause OOM on GPU?
  • concurrent_games_per_thread=32 - number of separate games per thread
  • virtual_losses=4 - number of tree leafs to evaluate per iteration

Ahh, I think I see it now. As long as selfplay_threads > parallel_inference (and there is beefy enough CPU) there should be no "GPU gaps", because as soon as one inference finishes, another selfplay thread can "hop in" immediately as long as it has a batch ready. In the case above there is one extra selfplay_thread (3>2) which will run on CPU even if both GPU models are occupied.

Presumably parallel_search=4 is driven by number of CPU cores on the system, where 4-per-gpu seems about usual.

Does above seem right?

Your analysis of the threading parameters is correct.

Their values were all chosen to get >95% utilization on the VM that I'm using to test the MLperf benchmark: it has 48 physical cores (96 hyperthreads) running at 2GHz and 8 v100 GPUs. Optimal values will be different depending on the relative performance of your CPUs and GPUs.

parallel_inference=2 is used to double-buffer the inference requests so that one thread can prepare the GPU commands while the other is actually executing them. On some setups, parallel_inference=3 is also a viable choice because inference happens in three stages: transfer the feature tensor to GPU, evaluate the model, transfer the output tensor back to CPU.

parallel_search=4 is set so that the tree search doesn't take too long relative to inference: the MLperf model is about 50x smaller than the full size Minigo model. In a full run, we don't need to be so careful tuning these parameters.

If you're interested in how these parameters affect performance, I recommend reading the
profiling section of the MLperf docs, which describe how to get CPU traces of the selfplay code. Enable tracing as described in the doc, run selfplay for about 30 seconds, kill the process, and you should have a WTF trace that you can view.

Hi

By "double-buffer the inference", do you mean simply running multiple independent models on same GPU (with obvious memory penalty)? Or is there something more going on? Assuming it's just independent models, is there any explicit mechanism to make sure models execute non-overlapping stages of GPU pipeline (transfer, evaluate, transfer back)? I kind of expect simply running 2-3 models per GPU would sort itself out on it's own in this use case, but just want to confirm.

Also, sorry for repeating, but could you explicitly confirm it's 1x concurrent_selfplay per GPU? While it seems pretty obvious by now (especially that I found hard coded gpu:0 somewhere), I'm still new to the code base and I don't want this important detail to be lost in translation.

I will definitely look into profiling after I manage to setup sacrificial CUDA dev. box for compilation purposes.

Thanks again for all the help, this was super useful! I think this wraps up my questions for now!

Yes, it's 1x concurrent_selfplay per GPU.

As for the double-buffering, well now we're getting to the interesting part. We currently create a new tensorflow::Session for every instance of the model. It's not sufficient to simply run multiple threads each performing instance: what tends to happen is that the TensorFlow framework ends up executing the Session::Run calls in lock-step, so if you're running N threads, every Session::Run call starts at the same time and they all take Nx longer to complete than a single call.

This is where the parallel_search flag comes into play. The fact that all threads share a global thread pool forces their calls to Session::Run to be staggered, which causes the TensorFlow framework to pipeline their execution correctly.

Here's old trace I found that illustrates this (it's from an experiment running on TPU with slightly different flags --selfplay_threads=4 --parallel_search=4 --parallel_inference=3 but you should get the idea):
selfplay_trace

Note that the SelectLeaf calls are scheduled more efficiently in the current master branch so if you generate a trace yourself it will look a bit different.

Now, it's possible that it would be more efficient to have all model instances loaded from the same file share the same TensorFlow session but that's just one more entry on my list of things I haven't had time to try out :)

Aha!

I had a sneaky suspicion that having ShardedExecutor shared between selfplay threads is not by accident, now we know why!

Off topic: what do you think about slightly alternative approach, having multiple game threads push eval requests to an async queue and then having neural-net thread(s) picking them up to form batches and execute on GPU? Basically implementing multiple-producer-multiple-consumer pattern. It seems to me MiniGo approach of having multiple concurrent games per thread is a strong benefit to keep total number of threads low. What's your opinion on other possible pros/cons of both approaches?

The threading model you describe is actually how Minigo selfplay used to be set up: we ran one selfplay thread for each game, and their inference requests were batched up and executed on separate inference threads. This was absolutely fine for the full sized Minigo run, we'd have maybe 8 games playing in parallel on a VM with 48 physical cores.

However, the model used for MLPerf is much smaller and we had to run significantly more selfplay threads than there were CPU cores to generate enough work for the GPU. This resulted in a large context switching overhead and reduced maximum GPU utilization.

The large number of threads and context switching overhead also made profiling the CPU code difficult. Once we switched to the current threading model, the simpler CPU traces showed there were some surprising hotspots in the code (e.g. calling argmax to select which node to visit during search). Here are some functions we found directly as a result of the simpler architecture that were optimized for up to 5x performance improvements:

int ArgMaxSse(absl::Span<const float> span) {

void MctsNode::CalculateChildActionScoreSse(PaddedSpan<float> result) const {

MG_ALWAYS_INLINE static void SetNchw(const ModelInput& input, uint8_t* dst) {

We also found that it was measurably faster to have the tree search thread call Session::Run directly, rather than have inference and tree search run on separate threads. This was most likely because the cache of the CPU core running the inference thread wouldn't have the tree search data in it. I only ever profiled this on TPU so I don't know if this finding also applies to GPU TensorFlow.

We also found that it was measurably faster to have the tree search thread call Session::Run directly, rather than have inference and tree search run on separate threads.

But if I understand correctly, in your current model tree search thread offloads leaf selection to thread pool in ShardedExecutor anyway, so in a sense they do run on separate threads. Yeah, it's interesting what actually is going on and how it would work on GPUs.

Yep, optimizing a multithreaded system is hard :)

The original implementation of the SharedExecutor was careful to always schedule the same games to the same thread for exactly this reason. However that lead to an imbalance of work across the threads because there's a large variation in the number of nodes tree search visits during a game. You can see this in the trace above where different SelectLeaf blocks take different amounts of time. Also note that the SelectLeaf block that runs on the selfplay thread is normally the fastest because it gets better cache utilization.

It turned out to be a net win for the SelectLeaf threads to share an atomic counter into the game array and pop the next available game to run tree search on. Since tree search for a game runs on an arbitrary thread each time, the individual SelectLeaf calls are slower, but the work is better distributed across the threads and so ends up taking less time.

Quick note: if you found the hard-coded gpu:0 where i think you did, it's because we isolate the selfplay jobs on GPUs using the CUDA_VISIBLE_DEVICES environment variable.

@tommadams I think what you say makes sense, but I need to think more about the implications.

@amj Yeah, that's exactly what I thought. The gpu:0 was a major clue :)

I'd caution against reading too much into what I wrote, these are specific optimizations I made for our architecture and hardware setup. Bottlenecks will vary based on model size, CPU & GPU compute speed, board size, code architecture, etc.

The most important take away should be: make sure it's easy to profile your code :)