Is it possible to add support for cpu/gpu binding (especially for Polaris that uses PBS + mpiexec)

Question

Is it possible to add support for cpu/gpu binding (especially for Polaris that uses PBS + mpiexec)

GKNB opened this issue 2 years ago · comments

Though it won't affect the correctness of the code, different binding options can significantly change the performance of the code, especially for cluster with NUMA architecture. Is it possible to add some options which allow users to specify the CPU/GPU binding in order to optimize the performance? Below are some examples of why it might be important.

1). CPU binding, MPI + OpenMP

Let's use Polaris as an example. Polaris has 4 NUMA, and each NUMA includes 8 cores. We want to submit a job with 4 processes (MPI), and each process uses 5 cores (OpenMP). There are two ways to submit the job:

1a). with --cpu-bind list:0,1,2,3,4:8,9,10,11,12:16,17,18,19,20:24,25,26,27,28
With this binding, we make sure that each process is using cores that are in the same NUMA

1b). with --cpu-bind list:0,1,2,3,4:5,6,7,8,9:10,11,12,13,14:15,16,17,18,19
With this binding, some processes will use cores across different NUMA

I find that the running time in 1b) could be 60% longer than that in 1a) using a simple inner-product test example, suggesting that a CPU binding option might be important in MPI + OpenMP program.

2). GPU binding

Let's use Polaris as an example. On Polaris, the CPU/GPU connection are as follows:

NUMA0 ---- cpu0-7 ---- GPU3
NUMA1 ---- cpu8-15 ---- GPU2
NUMA2 ---- cpu16-23 ---- GPU1
NUMA3 ---- cpu24-31 ---- GPU0

Currently, if I understand correctly, if we assign one GPU per process, rct will automatically assign GPU0 to local_rank_0, GPU1 to local_rank_1, etc. This will introduce a performance decrease if we do not manually set CPU binding so that local_rank_0 is binding with NUMA3, local_rank_1 is binding with NUMA2, ..., since the connecting between GPU and CPU is not optimized. Though it does not introduce a significant performance issue for my working application, for another real ML application, it introduces a running time increase of about 20%.

This issue is more serious on crusher, if I do not set the GPU binding correctly there, I can see a running time increase of more than 100%.

3). Why we want to have a manual control over binding.

The above experiment shows that binding can be important. Things become more complicated when we want to run jobs on CPU/GPU asynchronously. We can look at the following example:

We want to have the following two tasks running in parallel to optimize CPU and GPU on Polaris:
Task1: 4 process, each process use 1 GPU and 2 CPU
Task2: 24 processes, each process use 1 CPU

To optimize the performance, we want to make sure that task 1 set the GPU binding and CPU binding correctly, for example, as follows:
rank0 ---- cpu0,1 ---- GPU3
rank1 ---- cpu8,9 ---- GPU2
rank2 ---- cpu16,17 ---- GPU1
rank3 ---- cpu24,25 ---- GPU0
and then, we want to use the rest CPUs for task2. I think currently rct might not support such complicated binding set. Thanks!

Andre Merzky · Answer 1 · Wed Mar 22 2023 03:45:56 GMT+0800 (China Standard Time)

I am afraid that our task scheduler is not clever enough to understand NUMA boundaries and will thus indeed ignore those, potentially resulting in sub-optimal bindings. At the same time it does not support user supplied placement or binding hints.

We do however allow to configure a NOOP scheduler which does not placement and binding at all. Instead, placement and binding is left to the subsequent launchers such as srun or mpirun. I do not know how those behave on Crusher and Polaris - do you consider that a viable approach for your use case?

GKNB · Answer 2 · Wed Mar 22 2023 04:57:30 GMT+0800 (China Standard Time)

Thanks for your response! I think leaving the binding to srun and mpirun is something I am mostly familiar with, as currently this is something I usually do manually. For example, for complicated binding setup, I usually do "--cpu-bind verobse" and manually assign binding in mpirun / srun command, so this seems helpful for my case, but I need to understand how to use that on Polaris, since currently Polaris uses PBS for scheduling.

Another way of tackling this issue from my point of view is, is it possible to add a new feature that looks like the following: Adding two new attributes in task.cpu_reqs and task.gpu_reqs that are optional, and are called cpu_list / gpu_list, so that given an MPI + OpenMP + GPU task with k cpu_processes, n cpu_threads and m gpu_processes, the cpu_list is a k by n array, and cpu_list[i][j] is the j-th core index for the i-th process, and gpu_list is a k by m array, and gpu_list[i][j] is the j-th GPU index for the i-th process. If users do not care about this second-order optimization using binding, then they can just neglect these attributes and let rct decide how to assign cores and GPU, and if users care about that, then users should write down these two arrays by themselves. RCT can then pass that information to the launch command like srun / mpirun with "--cpu-bind verbose" and "--gpu-bind verbose". Then the scheduling can be done in a similar fashion as we schedule nodes, but this time we are trying to schedule smaller unit of core/gpu instead of node.

Of course, this will make the code difficult to port, as different clusters have different NUMA topology, so for different clusters people need to use different cpu_list and gpu_list, but as this is a second-order optimization, and it does improve the performance a lot, I think it would be better to support that then using the default binding.

For example, if we can have this feature, we can use that to solve the problem 3 above. I can write down something like:

t1.cpu_reqs = {
                'cpu_processes'    : 4,
                'cpu_process_type' : 'MPI',
                'cpu_threads'      : 2,
                'cpu_thread_type'  : OpenMP,
                'cpu_list'      : [[0,1],[8,9],[16,17],[24,25]]
                }

t1.gpu_reqs = {
                'gpu_processes'     : 1,
                'gpu_process_type'  : 'CUDA',
                'gpu_threads'       : 1,
                'gpu_thread_type'   : None
                'gpu_list'      : [[3],[2],[1],[0]]   #reverse order to match CPU and GPU within the same NUMA node
                }

t2.cpu_reqs = {
                'cpu_processes'    : 24,
                'cpu_process_type' : 'MPI',
                'cpu_threads'      : 1,
                'cpu_thread_type'  : None
                'cpu_list_array'      : [[2],[3],[4],[5],[6],[7],[10],[11],[12],[13],[14],[15],[18],[19],[20],[21],[22],[23],[26],[27],[28],[29],[30],[31]]
                }

Since t1 and t2 use different resources, rct should put them on the same node and run them asynchronously.

If tasks are running on multiple nodes, then these two arrays should be understood as allocation patterns on every node.

Does this sound possible? Thanks!

Andre Merzky · Answer 3 · Sat Apr 15 2023 14:33:40 GMT+0800 (China Standard Time)

This might actually be possible - but it would also interfer with the agent scheduler which might have a very different opinion about task placement, leading to conflicts with other tasks placed by it. Thanks though for the suggestion, we will try to come up with a way to implement it. What is the priority on this feature?

GKNB · Answer 4 · Wed May 03 2023 01:35:53 GMT+0800 (China Standard Time)

Hi Andre, thanks for the discussion. From my point of view, the priority of this feature is relatively high: The feature (or some other solution) is fundamental for the asynchronous running of multiple jobs on the same node. This is because I have the following two observations on Polaris:
1). If I submit two jobs with mpirun onto the same node, each with only one process, they will use the SAME CORE unless we manually tell it not to by using cpu-bind flag. Mikhail also noticed that in general, this can be solved using rankfile feature of mpirun, however it seems like there is no rankfile support on Polaris.
2). It is usually important to bind the CPU with GPU on the same NUMA unit for a single process. On Polaris I notice that the performance difference could be at least 20-50%, depending on whether we have the optimal binding setup. Our current research is focusing on asynchronous running of tasks on CPU and GPU, which has an upper limit on the performance improvement of a factor of 2, so if we can not solve the binding issue, this improvement from asynchronous running could be covered by the performance loss from the binding.
You mentioned earlier that the NOOP scheduler allows us to manually set the binding, but I think this has not been implemented, and with this scheduler, where should I tell rp or entk how to bind the task (for example, currently there is no attribute in task description that allows user to set the binding) Do you have any suggestion on that? Thanks!
Thanks!

Andre Merzky · Answer 5 · Wed May 03 2023 01:42:38 GMT+0800 (China Standard Time)

You mentioned earlier that the NOOP scheduler allows us to manually set the binding

Sorry, I was not clear then. It is rather that if we implement user specified binding, then we would have to use the NOOP scheduler.

If I submit two jobs with mpirun onto the same node, each with only one process, they will use the SAME CORE unless we manually tell it not to by using cpu-bind flag.
It is usually important to bind the CPU with GPU on the same NUMA unit for a single process.

Both requirements sound very reasonable to me. I am still hesitant to go down the road of user specified binding - it will open up a fair number of coordination and orchestration problems in our code. We will have to perform some analysis on what implementation work we would have to cover before going down that road.

An alternative approach to resolve the mentioned issues is to (a) make our scheduler NUMA-aware, and (b) support core-pinning in the respective launch methods. The latter one is not too difficult (once we understand the tools provided by the Polaris system to enforce pinning). The scheduler changes would be significant - but they would be isolated (not impact other parts of our code), and they would be portable (work on other NUMA machines also). The time frame to implement scheduler changes might actually be shorter than the implementation for user-defined bindings.

We'll take that topic back to our devel call. If you want feel free to join our Wednesday call which is open to users to pick up on that topic in person.

Best, Andre.

PS.: @mturilli @mtitov: ping to add the topic to our call agenda.

Matteo Turilli · Answer 6 · Sat May 06 2023 02:08:20 GMT+0800 (China Standard Time)

We agreed that it will be useful to work on the scheduler. That will take some time as it is a fairly time consuming activity. Meanwhile, the use case can progress, assuming that at some point in the future we will have a relevant performance improvement. That improvement will not require any change in the application code.

Matteo Turilli · Answer 7 · Tue Jul 18 2023 05:31:27 GMT+0800 (China Standard Time)

We started to work on this, opening a dedicated ticket on RP as this problem will be solved at that level.

Andre Merzky · Answer 8 · Wed Sep 06 2023 20:49:09 GMT+0800 (China Standard Time)

@GKNB :

We are still working on this. One problem we see right now is the lack of rankfilesupport in Polaris' mpiexec. without that support we cannot really enforce any layout determined by the scheduler, nor can we enact any specific layout provided by the end user. We are iterating with Polaris support on how to address this issue.