ucx-py perf test
TimZaman opened this issue · comments
I have been using ucx_perftest
to successfully confirm UCX performing close to line-rates.
I then took ucx-py (import ucp
), and used the send/receive example here: https://ucx-py.readthedocs.io/en/latest/quickstart.html#send-recv-numpy-arrays
I mutated this script in a few versions, and I am getting odd performance characteristics:
- If I have a server+client pair like the example above, and i put it in a loop, even re-using the connection, i get around <10% of bandwidth.
- If I have a server+client pair, where the client sends an array to the server, and the server sends an array back, and so-on and so-forth, ping-pong-ing, i get <10% of bandwidth.
- If I have a server+client pair, where the client is repeatedly sending data in a loop re-using the same connection, and the server just receiving them, i get close to 90% line-rate (yay) as long as the messages are large
- Any tips on the above?
- Any plans for a
ucx_perftest
but for ucx-py?
(PS, I was a first engineer on AI-Infra org at NVIDIA, hiiiii! 👋 )
Hi @TimZaman , welcome back interacting with us!
We do have something similar to ucx_perftest
, although not identical, that could be called via python -m ucp.benchmarks.send_recv
if you're using the latest UCX-Py 0.28 or current branch-0.29
. Could you try checking performance with that tool? Please make sure you test both --backend ucp-core
and --backend ucp-async
, the first one uses "regular" synchronous Python, whereas the second uses Python async, and you should see a substantial difference in performance between both for smaller sizes (Python async overhead can be considerably large compared to the actual transfer time).
Could you also clarify what you mean by "< 10% of bandwidth"? Do you mean less than 10% of theoretical expected bandwidth? Does the same apply to "90% line-rate" (i.e., 90% of theoretical bandwidth)?
Awesome! I expected this benchmark suite and this was exactly what I needed!
I'm testing with 64KB message sizes, and with ucp-core
I can get around 20% of theoretical bandwidth, which is decent, and I don't expect a single Python thread to be able to handle anything much faster.
Actually - even with default settings, pyucx does not seem to get close to theoretical bandwidth (approx 12GB/s here).
Below I show two outputs, the vanilla ucp.benchmarks.send_recv
getting around 5.7 GiB/s, while ucx_perftest
gets 12GB/s as expected
python -m ucp.benchmarks.send_recv
Server Running
Client connecting to server
Roundtrip benchmark
================================================================================
Iterations | 10
Bytes | 9.54 MiB
Object type | numpy
Reuse allocation | False
Transfer API | TAG
UCX_TLS | all
UCX_NET_DEVICES | all
================================================================================
Device(s) | CPU-only
Server CPU | affinity not set
Client CPU | affinity not set
================================================================================
Bandwidth (average) | 5.69 GiB/s
Bandwidth (median) | 5.72 GiB/s
Latency (average) | 1637275 ns
Latency (median) | 1629442 ns
================================================================================
Iterations | Bandwidth, Latency
--------------------------------------------------------------------------------
0 | 5.88 GiB/s, 1584484ns
1 | 5.60 GiB/s, 1664311ns
2 | 5.55 GiB/s, 1678366ns
3 | 5.50 GiB/s, 1692920ns
4 | 5.75 GiB/s, 1618935ns
5 | 5.75 GiB/s, 1618440ns
6 | 5.81 GiB/s, 1604069ns
7 | 5.64 GiB/s, 1652339ns
8 | 5.75 GiB/s, 1620438ns
9 | 5.68 GiB/s, 1638446ns
While ucx_perftest -t tag_bw
gets
+--------------+--------------+------------------------------+---------------------+-----------------------+
| | | overhead (usec) | bandwidth (MB/s) | message rate (msg/s) |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
| Stage | # iterations | 50.0%ile | average | overall | average | overall | average | overall |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
[thread 0] 145 0.317 7484.007 7484.007 12742.83 12742.83 134 134
[thread 0] 273 0.309 8486.757 7954.161 11237.21 11989.63 118 126
[thread 0] 401 0.303 8491.218 8125.591 11231.30 11736.68 118 123
[thread 0] 529 0.300 8478.742 8211.041 11247.83 11614.54 118 122
What kind of system do you have there? Given the rate you're achieving, I'm assuming you're running with InfiniBand. If InfiniBand is indeed the case, I think you're hitting one specific regression/corner case we see in https://raw.githack.com/pentschev/ucx-py-ci/test-results/assets/ucx-py-bandwidth.html , if you go to NumPy async/RC (top row, third column) and select 1.12.0, there was a dip we never actually had the chance to investigate properly. Depending on the hardware you have available, we might be able to determine if that is exactly what you're hitting or not, in general I would expect UCX-Py to have very close performance to UCX (70-80%+) for large enough message sizes (4MB or 8MB).
Besides that, UCX-Py overrides a handful of UCX configurations, perhaps undoing some may help for CPU cases, for example here I have:
Async UCX-Py default
$ python -m ucp.benchmarks.send_recv
Roundtrip benchmark
================================================================================
Iterations | 10
Bytes | 9.54 MiB
Object type | numpy
Reuse allocation | False
Transfer API | TAG
UCX_TLS | all
UCX_NET_DEVICES | all
================================================================================
Device(s) | CPU-only
Server CPU | affinity not set
Client CPU | affinity not set
================================================================================
Bandwidth (average) | 5.98 GiB/s
Bandwidth (median) | 6.08 GiB/s
Latency (average) | 1558108 ns
Latency (median) | 1532689 ns
================================================================================
Iterations | Bandwidth, Latency
--------------------------------------------------------------------------------
0 | 6.20 GiB/s, 1502382ns
1 | 6.25 GiB/s, 1490367ns
2 | 6.06 GiB/s, 1537924ns
3 | 5.87 GiB/s, 1587366ns
4 | 6.27 GiB/s, 1486337ns
5 | 5.36 GiB/s, 1738170ns
6 | 5.94 GiB/s, 1567602ns
7 | 6.10 GiB/s, 1527196ns
8 | 5.76 GiB/s, 1616280ns
9 | 6.10 GiB/s, 1527453ns
Reverted Async UCX-Py defaults
$ UCX_MAX_RNDV_RAILS=2 UCX_RNDV_THRESH=auto UCX_RNDV_SCHEME=auto python -m ucp.benchmarks.send_recv
Roundtrip benchmark
================================================================================
Iterations | 10
Bytes | 9.54 MiB
Object type | numpy
Reuse allocation | False
Transfer API | TAG
UCX_TLS | all
UCX_NET_DEVICES | all
================================================================================
Device(s) | CPU-only
Server CPU | affinity not set
Client CPU | affinity not set
================================================================================
Bandwidth (average) | 8.69 GiB/s
Bandwidth (median) | 8.72 GiB/s
Latency (average) | 1072017 ns
Latency (median) | 1067675 ns
================================================================================
Iterations | Bandwidth, Latency
--------------------------------------------------------------------------------
0 | 9.07 GiB/s, 1027366ns
1 | 8.82 GiB/s, 1055854ns
2 | 8.88 GiB/s, 1048929ns
3 | 8.38 GiB/s, 1111714ns
4 | 8.57 GiB/s, 1086856ns
5 | 8.07 GiB/s, 1153701ns
6 | 8.63 GiB/s, 1079497ns
7 | 8.95 GiB/s, 1040482ns
8 | 9.30 GiB/s, 1001290ns
9 | 8.36 GiB/s, 1114481ns
But I'm actually surprised how much worse CPU transfers perform with --backend ucp-core
, it seems like I always overlooked this situation without realizing it:
Reverted Sync UCX-Py defaults
$ UCX_MAX_RNDV_RAILS=2 UCX_RNDV_THRESH=auto UCX_RNDV_SCHEME=auto python -m ucp.benchmarks.send_recv --backend ucp-core -n 10MiB
Roundtrip benchmark
================================================================================
Iterations | 10
Bytes | 10.00 MiB
Object type | numpy
Reuse allocation | False
Transfer API | TAG
Delay progress | False
UCX_TLS | all
UCX_NET_DEVICES | all
================================================================================
Device(s) | CPU-only
Server CPU | affinity not set
Client CPU | affinity not set
================================================================================
Bandwidth (average) | 3.57 GiB/s
Bandwidth (median) | 3.58 GiB/s
Latency (average) | 2733794 ns
Latency (median) | 2729097 ns
================================================================================
Iterations | Bandwidth, Latency
--------------------------------------------------------------------------------
0 | 3.58 GiB/s, 2730071ns
1 | 3.57 GiB/s, 2734053ns
2 | 3.58 GiB/s, 2724421ns
3 | 3.58 GiB/s, 2731537ns
4 | 3.53 GiB/s, 2768109ns
5 | 3.61 GiB/s, 2703605ns
6 | 3.58 GiB/s, 2728122ns
7 | 3.60 GiB/s, 2709333ns
8 | 3.59 GiB/s, 2720375ns
9 | 3.50 GiB/s, 2788309ns
Most of these overridden defaults are either to work around UCX bugs/limitations or targeting better performance for GPU workflows. We focus more on GPU (perhaps too much) and maybe neglect CPU (which needs to be improved), but in the GPU case we are much closer to the UCX performance:
UCX CUDA
$ CUDA_VISIBLE_DEVICES=0,1 ucx_perftest -t tag_bw -m cuda -s 10000000 -n 10000 localhost
[1667292877.267151] [dgx13:4259 :0] perftest.c:900 UCX WARN CPU affinity is not set (bound to 80 cpus). Performance may be impacted.
+--------------+--------------+------------------------------+---------------------+-----------------------+
| | | overhead (usec) | bandwidth (MB/s) | message rate (msg/s) |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
| Stage | # iterations | 50.0%ile | average | overall | average | overall | average | overall |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
[thread 0] 2451 413.773 408.954 408.954 23319.83 23319.83 2445 2445
[thread 0] 4871 414.057 414.354 411.637 23015.94 23167.86 2413 2429
[thread 0] 7290 414.063 414.358 412.540 23015.69 23117.14 2413 2424
[thread 0] 9709 414.054 414.358 412.993 23015.68 23091.78 2413 2421
Final: 10000 413.961 460.416 414.373 20713.33 23014.87 2172 2413
Sync UCX-Py CUDA
$ python -m ucp.benchmarks.send_recv -d 0 -e 1 -o rmm -l ucp-core -n 10MiB
Roundtrip benchmark
================================================================================
Iterations | 10
Bytes | 10.00 MiB
Object type | rmm
Reuse allocation | False
Transfer API | TAG
Delay progress | False
UCX_TLS | all
UCX_NET_DEVICES | all
================================================================================
Device(s) | 0, 1
================================================================================
Bandwidth (average) | 19.98 GiB/s
Bandwidth (median) | 20.04 GiB/s
Latency (average) | 488713 ns
Latency (median) | 487421 ns
================================================================================
Iterations | Bandwidth, Latency
--------------------------------------------------------------------------------
0 | 19.68 GiB/s, 496252ns
1 | 19.97 GiB/s, 488969ns
2 | 20.06 GiB/s, 486839ns
3 | 20.07 GiB/s, 486643ns
4 | 20.10 GiB/s, 485858ns
5 | 19.85 GiB/s, 491996ns
6 | 20.03 GiB/s, 487621ns
7 | 20.04 GiB/s, 487222ns
8 | 20.06 GiB/s, 486862ns
9 | 19.98 GiB/s, 488870ns
Async UCX CUDA (now we see the async bottleneck)
$ python -m ucp.benchmarks.send_recv -d 0 -e 1 -o rmm -l ucp-async -n 10
MiB
Roundtrip benchmark
================================================================================
Iterations | 10
Bytes | 10.00 MiB
Object type | rmm
Reuse allocation | False
Transfer API | TAG
UCX_TLS | all
UCX_NET_DEVICES | all
================================================================================
Device(s) | 0, 1
================================================================================
Bandwidth (average) | 10.56 GiB/s
Bandwidth (median) | 10.59 GiB/s
Latency (average) | 925143 ns
Latency (median) | 921956 ns
================================================================================
Iterations | Bandwidth, Latency
--------------------------------------------------------------------------------
0 | 10.86 GiB/s, 899194ns
1 | 10.87 GiB/s, 898291ns
2 | 10.73 GiB/s, 909780ns
3 | 10.48 GiB/s, 932122ns
4 | 10.82 GiB/s, 902704ns
5 | 10.27 GiB/s, 951227ns
6 | 9.96 GiB/s, 980224ns
7 | 10.51 GiB/s, 928893ns
8 | 10.67 GiB/s, 915019ns
9 | 10.46 GiB/s, 933981ns
We have a complete rewrite of UCX-Py in C++ coming up, which will also allow multi-threading. Would you mind telling us more about your expected use case (types of compute and interconnect devices, message sizes you expect to perform well, whether you're using Python sync or async interfaces, etc.)? Anything you can tell us is useful.
I think this might be because we set some defaults differently from UCX, with the aim of getting slightly better performance for GPU-to-GPU messages, see details here. Can you try with:
UCX_MAX_RNDV_RAILS=2 python -m ucp.benchmarks.send_recv
?
In addition, by default the ucx-py send-recv benchmark also measures the memory allocation cost as part of the message ping-pong time (I think that ucx_perftest
does not do so, instead pre-allocating the buffers and reusing them).
You can get behaviour with reuse by saying --reuse-alloc
. On my system:
$ python -m ucp.benchmarks.send_recv --no-detailed-report -o numpy
...
Bandwidth (average) | 6.12 GiB/s
...
$ python -m ucp.benchmarks.send_recv --no-detailed-report -o numpy --reuse-alloc
...
Bandwidth (average) | 7.50 GiB/s
...
$ python -m ucp.benchmarks.send_recv --no-detailed-report -o numpy --reuse-alloc -b 0 -c 2 # process pinning
...
Bandwidth (average) | 7.88 GiB/s
...
$ UCX_MAX_RNDV_RAILS=2 python -m ucp.benchmarks.send_recv --no-detailed-report -o numpy --reuse-alloc -b 0 -c 2
...
Bandwidth (average) | 13.48 GiB/s
...
But I'm actually surprised how much worse CPU transfers perform with
--backend ucp-core
, it seems like I always overlooked this situation without realizing it:
I think this is a consequence of not using --reuse-alloc
in the benchmark.
I think this is a consequence of not using
--reuse-alloc
in the benchmark.
Very good catch, this is less of a problem when using RMM because of the pool.
Updated numbers (now equal to UCX for Python sync, and ~70% performance with Python async)
Reverted Async UCX-Py defaults
$ UCX_MAX_RNDV_RAILS=2 UCX_RNDV_THRESH=auto UCX_RNDV_SCHEME=auto python -m ucp.benchmarks.send_recv --backend ucp-async -n 10MiB --reuse-alloc
Server Running at 10.33.225.163:38451
Client connecting to server at 10.33.225.163:38451
Roundtrip benchmark
================================================================================
Iterations | 10
Bytes | 10.00 MiB
Object type | numpy
Reuse allocation | True
Transfer API | TAG
UCX_TLS | all
UCX_NET_DEVICES | all
================================================================================
Device(s) | CPU-only
Server CPU | affinity not set
Client CPU | affinity not set
================================================================================
Bandwidth (average) | 12.70 GiB/s
Bandwidth (median) | 12.55 GiB/s
Latency (average) | 768644 ns
Latency (median) | 778201 ns
================================================================================
Iterations | Bandwidth, Latency
--------------------------------------------------------------------------------
0 | 14.05 GiB/s, 695077ns
1 | 12.28 GiB/s, 795126ns
2 | 12.67 GiB/s, 771001ns
3 | 12.81 GiB/s, 762223ns
4 | 12.30 GiB/s, 794172ns
5 | 12.43 GiB/s, 785401ns
6 | 13.14 GiB/s, 742951ns
7 | 11.68 GiB/s, 835865ns
8 | 13.76 GiB/s, 709704ns
9 | 12.28 GiB/s, 794922ns
Reverted Sync UCX-Py defaults
$ UCX_MAX_RNDV_RAILS=2 UCX_RNDV_THRESH=auto UCX_RNDV_SCHEME=auto python -m ucp.benchmarks.send_recv --backend ucp-core -n 10MiB --reuse-alloc
Server Running at 10.33.225.163:35783
Client connecting to server at 10.33.225.163:35783
Roundtrip benchmark
================================================================================
Iterations | 10
Bytes | 10.00 MiB
Object type | numpy
Reuse allocation | True
Transfer API | TAG
Delay progress | False
UCX_TLS | all
UCX_NET_DEVICES | all
================================================================================
Device(s) | CPU-only
Server CPU | affinity not set
Client CPU | affinity not set
================================================================================
Bandwidth (average) | 18.91 GiB/s
Bandwidth (median) | 19.03 GiB/s
Latency (average) | 516376 ns
Latency (median) | 513188 ns
================================================================================
Iterations | Bandwidth, Latency
--------------------------------------------------------------------------------
0 | 19.02 GiB/s, 513378ns
1 | 19.09 GiB/s, 511684ns
2 | 18.65 GiB/s, 523634ns
3 | 19.19 GiB/s, 508946ns
4 | 19.17 GiB/s, 509530ns
5 | 19.08 GiB/s, 511795ns
6 | 18.09 GiB/s, 539785ns
7 | 18.95 GiB/s, 515232ns
8 | 18.90 GiB/s, 516781ns
9 | 19.04 GiB/s, 512999ns
Also, this discussion makes me realize the performance drop we see in https://raw.githack.com/pentschev/ucx-py-ci/test-results/assets/ucx-py-bandwidth.html during 1.12.0 is actually because we introduced the UCX_RNDV_RAILS=1
override in UCX-Py, which clearly is a poor choice for CPU-only, whereas for GPUs it is a better option as it would prevent part of the message going through a suboptimal path (e.g., when NVLink is available but UCX would still send the remaining of the message via InfiniBand or even TCP).