Experiment with improving the parallelization of the layout optimizer

Question

Experiment with improving the parallelization of the layout optimizer

LTLA opened this issue a year ago · comments

It may be possible to improve the parallelizability of optimize_layout without changing the results from the single-threaded case. This is done by precomputing a "graph" of dependencies between observations and spinning off threads if we observe a series of observations that don't depend on each other.

Iterate across observations within each epoch as usual. At each observation, find its neighbors and also hit the RNG to get the negative samples. The combined set of neighbors and negative samples defines the "dependencies".
Check if the dependencies depend on any other queued observations in other threads.
- If yes and it's just one thread, add it to the task queue for that thread.
- If yes and it's multiple threads, close the task queues for all threads. Put this observation on hold.
- If not, add it to the task queue of the thread with the shortest task queue.
If all task queues are closed, actually execute the compute by spinning up one thread per non-empty task queue.
Restart this process with the on-hold observation, or finish if no observation was held.

Hopefully, the task queues build up to a healthy size and we don't have to spin up too many threads. At around 10 us per thread and 500 epochs, we'd be aiming for ~100 start/stops to keep the thread overhead under 10 seconds.

Note that this also necessitates holding onto the identities of the negative samples. This shouldn't take a lot of memory, but it may be wise to set a buffer size (e.g., 10 million negative samples) that closes all task queues and triggers execution.

Aaron Lun · Answer 1 · Sun Mar 19 2023 09:46:35 GMT+0800 (China Standard Time)

It is done: e98d911 (#14). Code is not for the faint-hearted.

tl;dr it probably wasn't worth it.

The final implementation differs quite a bit from the above description. In the merged version, we iterate across observations at usual. For each observation, we check if it, its neighbors, or its negative samples were accessed in previous observations. If not, we send the task of updating that observation's embeddings (and the associated updates to its neighbors) to a waiting thread. This is repeated for the next observations until all threads are occupied or we get an observation that needs to access coordinates that are currently being updated by an active thread. We then wait for all active threads to finish, reset the access logs, and repeat this process until we run out of observations.

I distribute update tasks for individual observations rather than building a queue of tasks because the chance of conflicting accesses is too high with multiple observations. It's basically the birthday problem - if you take even a small group of observations, you'll probably run into at least one conflict, which requires us to stop all new update tasks until the existing threads finish their jobs. By distributing individual observations, we reduce the frequency of conflicts between threads. This frequency is further reduced when you have larger dataset because the graph becomes increasingly sparse, so it's unlikely that multiple threads will be working on the same things at the same time.

Having said that, the need to distribute tasks for individual observations means that we need to make inter-thread communication as fast as possible. Obviously this requires persistent threads as I'm not going to spin up a new thread per observation. Unfortunately, mutexes don't cut it for inter-thread synchronization; there's too much stalling when we're distributing millions of jobs. Instead, we use a busy-wait approach where each thread polls an atomic bool to check whether it can start working (or if it's finished). This is very fast and allows threads to pick up jobs as soon as they are distributed. The downside is that (i) we get CPU usage of 100% for each thread, which is pretty wasteful w.r.t. energy consumption if they're not actually doing anything, and (ii) the user better have enough available CPUs, as any competition between threads will destroy performance when the busy-waits start interfering with actual compute.

So, with that design (and after resolving the many race conditions that keep on popping up), we get the following results. I'll use the yaumap package for testing on real data, specifically the MNIST digits dataset from snedata. To keep things simple, I already computed the nearest neighbors so that we can see the impact of parallelization on the layout optimization specifically, instead of getting swamped by the runtime of NN detection.

res <- readRDS("nn.rds")
library(yaumap)
system.time(out <- umap_from_neighbors(res$index, res$distance, num_threads=4, num_epochs=200))
##    user  system elapsed 
##  27.076   0.019  24.238 

system.time(out2 <- umap_from_neighbors(res$index, res$distance, num_threads=4, parallel_optimization=TRUE, num_epochs=200))
##    user  system elapsed 
##  69.423   0.021  17.416 

identical(out, out2)
## [1] TRUE

Same results, but ~30% faster, hooray. Keep in mind that this is with 400% CPU usage due to the busy-waits, so you can see it's not very cost-effective. But it is at least faster, so if cost is not a concern, and you have the cores...

The biggest bottleneck in the parallel version is the pseudo-random number generation. This step is done only in the main thread; the negative samples are then stored for distribution to each worker thread. I reckon it's responsible for about 10 seconds of the runtime, which is reasonable given that we're generating around a billion random numbers (200 epochs, 70,000 observations, 15 neighbors, about 5 negative samples on average = 1.05 billion numbers). Performance could definitely be improved by switching to a better PRNG and/or discrete sampling function, e.g., replacing MT with PCG32 and replacing my aarand::discrete_uniform with GCC's no-modulo method. If you're particularly wild, you could even create one RNG stream per observation and then you could sample within the worker thread; this would require PRNGs with a small state (i.e., not MT), and you would need to accept different results from single-threaded runs.

An interesting tidbit is that the RNG is not a major contributor to runtime in the serial version. Replacing the RNG with pre-specified numbers only exhibits modest decreases in runtime (<0.5 seconds) in the single-threaded code. I speculate that this is because the compiler reorders some operations in the update iterations across observations; specifically, I'm guessing that it kicks off the instructions to do the random number sampling for the next iteration while the updates are being performed for the current iteration, running them in a kind-of-parallel way and eliminating the cost of the RNG.

Tagging @jlmelville, who might be interested in this story given our adventures in jlmelville/uwot#83.

James Melville · Answer 2 · Sun Mar 19 2023 12:42:00 GMT+0800 (China Standard Time)

I am slightly in awe of the heroics that have gone into making this scheme work. I knew when I read the daunting initial description that this isn't something I would have conceived of or been able to implement in a way that was competitive with the standard asynchronous approach, let alone be faster.

@LTLA how much of the decisions are fundamental to the inter-thread communications versus compiler or architecture-specific performance around e.g. mutexes vs busy-waiting? From your write-up you seem ambivalent about the outcome: is this a dead end (but presumably an intellectually satisfying accomplishment) or something that could be made more CPU-efficient with e.g. more exotic threading constructs?

Aaron Lun · Answer 3 · Sun Mar 19 2023 16:06:47 GMT+0800 (China Standard Time)

how much of the decisions are fundamental to the inter-thread communications versus compiler or architecture-specific performance around e.g. mutexes vs busy-waiting?

Hm. I did most of my development on an old Intel Mac with clang during the day, and kept on going on my slightly-less-old Ubuntu machine running GCC in the evening. While the absolute timings obviously differ across machines, the relative difference between the various strategies within the same machine are pretty consistent (IIRC). So it may indeed be something more fundamental, though perhaps things might be different on a non-Intel architecture.

is this a dead end (but presumably an intellectually satisfying accomplishment) or something that could be made more CPU-efficient with e.g. more exotic threading constructs?

Actually, I don't believe the threading is the issue. If I disable the threads and don't distribute any jobs to any workers at all, the umap_from_neighbors() call still takes 15-16 seconds! At this point, the only thing it's doing during the iterations is sampling random numbers, plus some checks for conflicting accesses. So, even the best threading approach would just save us 1-2 seconds; hence my focus on the PRNG performance in my above essay.

With a sufficiently fast PRNG, it should be possible to distribute jobs to workers quicker, and then we'd probably get a speed-up that matches up better with CPU usage. The PCG site suggests a few PRNGs that are more than twice as fast as the 64-bit MT I'm using, so that could be a good place to start. I guess I could allow people to template in an alternative PRNG themselves; I'm a bit reluctant to stray outside of the standard library for default uses of umappp.

If we are willing to change the algorithm a little, it is possible to parallelize the RN generation across observations by using PRNGs that support multiple streams. Each observation would get its own RNG instance at a different stream; each worker would then be asked to generate negative samples; control would return back to the serial section to check for conflicting accesses; and then the update tasks would be distributed for observations without any conflicts. This adds an extra round of inter-thread communication to shift RN generation into each worker, and should greatly mitigate the associated bottleneck in the serial section that is present in the current implementation.

Of course, that's even more complicated than the current version. And the complexity of the code is already a major problem - these damn race conditions are extremely painful to resolve. It's for this reason that I wouldn't recommend implementing this in uwot - the benefits are too modest to justify the loss of hair.