rapidsai / ucx-py

Python bindings for UCX

Home Page:https://ucx-py.readthedocs.io/en/latest/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ucx-py perf test

TimZaman opened this issue · comments

I have been using ucx_perftest to successfully confirm UCX performing close to line-rates.

I then took ucx-py (import ucp), and used the send/receive example here: https://ucx-py.readthedocs.io/en/latest/quickstart.html#send-recv-numpy-arrays

I mutated this script in a few versions, and I am getting odd performance characteristics:

  1. If I have a server+client pair like the example above, and i put it in a loop, even re-using the connection, i get around <10% of bandwidth.
  2. If I have a server+client pair, where the client sends an array to the server, and the server sends an array back, and so-on and so-forth, ping-pong-ing, i get <10% of bandwidth.
  3. If I have a server+client pair, where the client is repeatedly sending data in a loop re-using the same connection, and the server just receiving them, i get close to 90% line-rate (yay) as long as the messages are large
  • Any tips on the above?
  • Any plans for a ucx_perftest but for ucx-py?

(PS, I was a first engineer on AI-Infra org at NVIDIA, hiiiii! 👋 )

Hi @TimZaman , welcome back interacting with us!

We do have something similar to ucx_perftest, although not identical, that could be called via python -m ucp.benchmarks.send_recv if you're using the latest UCX-Py 0.28 or current branch-0.29. Could you try checking performance with that tool? Please make sure you test both --backend ucp-core and --backend ucp-async, the first one uses "regular" synchronous Python, whereas the second uses Python async, and you should see a substantial difference in performance between both for smaller sizes (Python async overhead can be considerably large compared to the actual transfer time).

Could you also clarify what you mean by "< 10% of bandwidth"? Do you mean less than 10% of theoretical expected bandwidth? Does the same apply to "90% line-rate" (i.e., 90% of theoretical bandwidth)?

Awesome! I expected this benchmark suite and this was exactly what I needed!

I'm testing with 64KB message sizes, and with ucp-core I can get around 20% of theoretical bandwidth, which is decent, and I don't expect a single Python thread to be able to handle anything much faster.

Actually - even with default settings, pyucx does not seem to get close to theoretical bandwidth (approx 12GB/s here).

Below I show two outputs, the vanilla ucp.benchmarks.send_recv getting around 5.7 GiB/s, while ucx_perftest gets 12GB/s as expected

python -m ucp.benchmarks.send_recv
Server Running
Client connecting to server
Roundtrip benchmark
================================================================================
Iterations                | 10
Bytes                     | 9.54 MiB
Object type               | numpy
Reuse allocation          | False
Transfer API              | TAG
UCX_TLS                   | all
UCX_NET_DEVICES           | all
================================================================================
Device(s)                 | CPU-only
Server CPU                | affinity not set
Client CPU                | affinity not set
================================================================================
Bandwidth (average)       | 5.69 GiB/s
Bandwidth (median)        | 5.72 GiB/s
Latency (average)         | 1637275 ns
Latency (median)          | 1629442 ns
================================================================================
Iterations                | Bandwidth, Latency
--------------------------------------------------------------------------------
0                         | 5.88 GiB/s, 1584484ns
1                         | 5.60 GiB/s, 1664311ns
2                         | 5.55 GiB/s, 1678366ns
3                         | 5.50 GiB/s, 1692920ns
4                         | 5.75 GiB/s, 1618935ns
5                         | 5.75 GiB/s, 1618440ns
6                         | 5.81 GiB/s, 1604069ns
7                         | 5.64 GiB/s, 1652339ns
8                         | 5.75 GiB/s, 1620438ns
9                         | 5.68 GiB/s, 1638446ns

While ucx_perftest -t tag_bw gets

+--------------+--------------+------------------------------+---------------------+-----------------------+
|              |              |       overhead (usec)        |   bandwidth (MB/s)  |  message rate (msg/s) |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
|    Stage     | # iterations | 50.0%ile | average | overall |  average |  overall |  average  |  overall  |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
[thread 0]               145      0.317  7484.007  7484.007    12742.83   12742.83         134         134
[thread 0]               273      0.309  8486.757  7954.161    11237.21   11989.63         118         126
[thread 0]               401      0.303  8491.218  8125.591    11231.30   11736.68         118         123
[thread 0]               529      0.300  8478.742  8211.041    11247.83   11614.54         118         122

What kind of system do you have there? Given the rate you're achieving, I'm assuming you're running with InfiniBand. If InfiniBand is indeed the case, I think you're hitting one specific regression/corner case we see in https://raw.githack.com/pentschev/ucx-py-ci/test-results/assets/ucx-py-bandwidth.html , if you go to NumPy async/RC (top row, third column) and select 1.12.0, there was a dip we never actually had the chance to investigate properly. Depending on the hardware you have available, we might be able to determine if that is exactly what you're hitting or not, in general I would expect UCX-Py to have very close performance to UCX (70-80%+) for large enough message sizes (4MB or 8MB).

Besides that, UCX-Py overrides a handful of UCX configurations, perhaps undoing some may help for CPU cases, for example here I have:

Async UCX-Py default
$ python -m ucp.benchmarks.send_recv
Roundtrip benchmark
================================================================================
Iterations                | 10
Bytes                     | 9.54 MiB
Object type               | numpy
Reuse allocation          | False
Transfer API              | TAG
UCX_TLS                   | all
UCX_NET_DEVICES           | all
================================================================================
Device(s)                 | CPU-only
Server CPU                | affinity not set
Client CPU                | affinity not set
================================================================================
Bandwidth (average)       | 5.98 GiB/s
Bandwidth (median)        | 6.08 GiB/s
Latency (average)         | 1558108 ns
Latency (median)          | 1532689 ns
================================================================================
Iterations                | Bandwidth, Latency
--------------------------------------------------------------------------------
0                         | 6.20 GiB/s, 1502382ns
1                         | 6.25 GiB/s, 1490367ns
2                         | 6.06 GiB/s, 1537924ns
3                         | 5.87 GiB/s, 1587366ns
4                         | 6.27 GiB/s, 1486337ns
5                         | 5.36 GiB/s, 1738170ns
6                         | 5.94 GiB/s, 1567602ns
7                         | 6.10 GiB/s, 1527196ns
8                         | 5.76 GiB/s, 1616280ns
9                         | 6.10 GiB/s, 1527453ns
Reverted Async UCX-Py defaults
$ UCX_MAX_RNDV_RAILS=2 UCX_RNDV_THRESH=auto UCX_RNDV_SCHEME=auto python -m ucp.benchmarks.send_recv
Roundtrip benchmark
================================================================================
Iterations                | 10
Bytes                     | 9.54 MiB
Object type               | numpy
Reuse allocation          | False
Transfer API              | TAG
UCX_TLS                   | all
UCX_NET_DEVICES           | all
================================================================================
Device(s)                 | CPU-only
Server CPU                | affinity not set
Client CPU                | affinity not set
================================================================================
Bandwidth (average)       | 8.69 GiB/s
Bandwidth (median)        | 8.72 GiB/s
Latency (average)         | 1072017 ns
Latency (median)          | 1067675 ns
================================================================================
Iterations                | Bandwidth, Latency
--------------------------------------------------------------------------------
0                         | 9.07 GiB/s, 1027366ns
1                         | 8.82 GiB/s, 1055854ns
2                         | 8.88 GiB/s, 1048929ns
3                         | 8.38 GiB/s, 1111714ns
4                         | 8.57 GiB/s, 1086856ns
5                         | 8.07 GiB/s, 1153701ns
6                         | 8.63 GiB/s, 1079497ns
7                         | 8.95 GiB/s, 1040482ns
8                         | 9.30 GiB/s, 1001290ns
9                         | 8.36 GiB/s, 1114481ns

But I'm actually surprised how much worse CPU transfers perform with --backend ucp-core, it seems like I always overlooked this situation without realizing it:

Reverted Sync UCX-Py defaults
$ UCX_MAX_RNDV_RAILS=2 UCX_RNDV_THRESH=auto UCX_RNDV_SCHEME=auto python -m ucp.benchmarks.send_recv --backend ucp-core -n 10MiB
Roundtrip benchmark
================================================================================
Iterations                | 10
Bytes                     | 10.00 MiB
Object type               | numpy
Reuse allocation          | False
Transfer API              | TAG
Delay progress            | False
UCX_TLS                   | all
UCX_NET_DEVICES           | all
================================================================================
Device(s)                 | CPU-only
Server CPU                | affinity not set
Client CPU                | affinity not set
================================================================================
Bandwidth (average)       | 3.57 GiB/s
Bandwidth (median)        | 3.58 GiB/s
Latency (average)         | 2733794 ns
Latency (median)          | 2729097 ns
================================================================================
Iterations                | Bandwidth, Latency
--------------------------------------------------------------------------------
0                         | 3.58 GiB/s, 2730071ns
1                         | 3.57 GiB/s, 2734053ns
2                         | 3.58 GiB/s, 2724421ns
3                         | 3.58 GiB/s, 2731537ns
4                         | 3.53 GiB/s, 2768109ns
5                         | 3.61 GiB/s, 2703605ns
6                         | 3.58 GiB/s, 2728122ns
7                         | 3.60 GiB/s, 2709333ns
8                         | 3.59 GiB/s, 2720375ns
9                         | 3.50 GiB/s, 2788309ns

Most of these overridden defaults are either to work around UCX bugs/limitations or targeting better performance for GPU workflows. We focus more on GPU (perhaps too much) and maybe neglect CPU (which needs to be improved), but in the GPU case we are much closer to the UCX performance:

UCX CUDA
$ CUDA_VISIBLE_DEVICES=0,1 ucx_perftest -t tag_bw -m cuda -s 10000000 -n 10000 localhost
[1667292877.267151] [dgx13:4259 :0]        perftest.c:900  UCX  WARN  CPU affinity is not set (bound to 80 cpus). Performance may be impacted.
+--------------+--------------+------------------------------+---------------------+-----------------------+
|              |              |       overhead (usec)        |   bandwidth (MB/s)  |  message rate (msg/s) |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
|    Stage     | # iterations | 50.0%ile | average | overall |  average |  overall |  average  |  overall  |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
[thread 0]              2451    413.773   408.954   408.954    23319.83   23319.83        2445        2445
[thread 0]              4871    414.057   414.354   411.637    23015.94   23167.86        2413        2429
[thread 0]              7290    414.063   414.358   412.540    23015.69   23117.14        2413        2424
[thread 0]              9709    414.054   414.358   412.993    23015.68   23091.78        2413        2421
Final:                 10000    413.961   460.416   414.373    20713.33   23014.87        2172        2413
Sync UCX-Py CUDA
$ python -m ucp.benchmarks.send_recv -d 0 -e 1 -o rmm -l ucp-core -n 10MiB
Roundtrip benchmark
================================================================================
Iterations                | 10
Bytes                     | 10.00 MiB
Object type               | rmm
Reuse allocation          | False
Transfer API              | TAG
Delay progress            | False
UCX_TLS                   | all
UCX_NET_DEVICES           | all
================================================================================
Device(s)                 | 0, 1
================================================================================
Bandwidth (average)       | 19.98 GiB/s
Bandwidth (median)        | 20.04 GiB/s
Latency (average)         | 488713 ns
Latency (median)          | 487421 ns
================================================================================
Iterations                | Bandwidth, Latency
--------------------------------------------------------------------------------
0                         | 19.68 GiB/s, 496252ns
1                         | 19.97 GiB/s, 488969ns
2                         | 20.06 GiB/s, 486839ns
3                         | 20.07 GiB/s, 486643ns
4                         | 20.10 GiB/s, 485858ns
5                         | 19.85 GiB/s, 491996ns
6                         | 20.03 GiB/s, 487621ns
7                         | 20.04 GiB/s, 487222ns
8                         | 20.06 GiB/s, 486862ns
9                         | 19.98 GiB/s, 488870ns
Async UCX CUDA (now we see the async bottleneck)
$ python -m ucp.benchmarks.send_recv -d 0 -e 1 -o rmm -l ucp-async -n 10
MiB
Roundtrip benchmark
================================================================================
Iterations                | 10
Bytes                     | 10.00 MiB
Object type               | rmm
Reuse allocation          | False
Transfer API              | TAG
UCX_TLS                   | all
UCX_NET_DEVICES           | all
================================================================================
Device(s)                 | 0, 1
================================================================================
Bandwidth (average)       | 10.56 GiB/s
Bandwidth (median)        | 10.59 GiB/s
Latency (average)         | 925143 ns
Latency (median)          | 921956 ns
================================================================================
Iterations                | Bandwidth, Latency
--------------------------------------------------------------------------------
0                         | 10.86 GiB/s, 899194ns
1                         | 10.87 GiB/s, 898291ns
2                         | 10.73 GiB/s, 909780ns
3                         | 10.48 GiB/s, 932122ns
4                         | 10.82 GiB/s, 902704ns
5                         | 10.27 GiB/s, 951227ns
6                         | 9.96 GiB/s, 980224ns
7                         | 10.51 GiB/s, 928893ns
8                         | 10.67 GiB/s, 915019ns
9                         | 10.46 GiB/s, 933981ns

We have a complete rewrite of UCX-Py in C++ coming up, which will also allow multi-threading. Would you mind telling us more about your expected use case (types of compute and interconnect devices, message sizes you expect to perform well, whether you're using Python sync or async interfaces, etc.)? Anything you can tell us is useful.

I think this might be because we set some defaults differently from UCX, with the aim of getting slightly better performance for GPU-to-GPU messages, see details here. Can you try with:

UCX_MAX_RNDV_RAILS=2 python -m ucp.benchmarks.send_recv

?

In addition, by default the ucx-py send-recv benchmark also measures the memory allocation cost as part of the message ping-pong time (I think that ucx_perftest does not do so, instead pre-allocating the buffers and reusing them).

You can get behaviour with reuse by saying --reuse-alloc. On my system:

$ python -m ucp.benchmarks.send_recv --no-detailed-report -o numpy
...
Bandwidth (average)       | 6.12 GiB/s
...
$ python -m ucp.benchmarks.send_recv --no-detailed-report -o numpy --reuse-alloc
...
Bandwidth (average)       | 7.50 GiB/s
...
$ python -m ucp.benchmarks.send_recv --no-detailed-report -o numpy --reuse-alloc -b 0 -c 2 # process pinning
...
Bandwidth (average)       | 7.88 GiB/s
...
$ UCX_MAX_RNDV_RAILS=2 python -m ucp.benchmarks.send_recv --no-detailed-report -o numpy --reuse-alloc -b 0 -c 2
...
Bandwidth (average)       | 13.48 GiB/s
...

But I'm actually surprised how much worse CPU transfers perform with --backend ucp-core, it seems like I always overlooked this situation without realizing it:

I think this is a consequence of not using --reuse-alloc in the benchmark.

I think this is a consequence of not using --reuse-alloc in the benchmark.

Very good catch, this is less of a problem when using RMM because of the pool.

Updated numbers (now equal to UCX for Python sync, and ~70% performance with Python async)

Reverted Async UCX-Py defaults
$ UCX_MAX_RNDV_RAILS=2 UCX_RNDV_THRESH=auto UCX_RNDV_SCHEME=auto python -m ucp.benchmarks.send_recv --backend ucp-async -n 10MiB --reuse-alloc
Server Running at 10.33.225.163:38451
Client connecting to server at 10.33.225.163:38451
Roundtrip benchmark
================================================================================
Iterations                | 10
Bytes                     | 10.00 MiB
Object type               | numpy
Reuse allocation          | True
Transfer API              | TAG
UCX_TLS                   | all
UCX_NET_DEVICES           | all
================================================================================
Device(s)                 | CPU-only
Server CPU                | affinity not set
Client CPU                | affinity not set
================================================================================
Bandwidth (average)       | 12.70 GiB/s
Bandwidth (median)        | 12.55 GiB/s
Latency (average)         | 768644 ns
Latency (median)          | 778201 ns
================================================================================
Iterations                | Bandwidth, Latency
--------------------------------------------------------------------------------
0                         | 14.05 GiB/s, 695077ns
1                         | 12.28 GiB/s, 795126ns
2                         | 12.67 GiB/s, 771001ns
3                         | 12.81 GiB/s, 762223ns
4                         | 12.30 GiB/s, 794172ns
5                         | 12.43 GiB/s, 785401ns
6                         | 13.14 GiB/s, 742951ns
7                         | 11.68 GiB/s, 835865ns
8                         | 13.76 GiB/s, 709704ns
9                         | 12.28 GiB/s, 794922ns
Reverted Sync UCX-Py defaults
$ UCX_MAX_RNDV_RAILS=2 UCX_RNDV_THRESH=auto UCX_RNDV_SCHEME=auto python -m ucp.benchmarks.send_recv --backend ucp-core -n 10MiB --reuse-alloc
Server Running at 10.33.225.163:35783
Client connecting to server at 10.33.225.163:35783
Roundtrip benchmark
================================================================================
Iterations                | 10
Bytes                     | 10.00 MiB
Object type               | numpy
Reuse allocation          | True
Transfer API              | TAG
Delay progress            | False
UCX_TLS                   | all
UCX_NET_DEVICES           | all
================================================================================
Device(s)                 | CPU-only
Server CPU                | affinity not set
Client CPU                | affinity not set
================================================================================
Bandwidth (average)       | 18.91 GiB/s
Bandwidth (median)        | 19.03 GiB/s
Latency (average)         | 516376 ns
Latency (median)          | 513188 ns
================================================================================
Iterations                | Bandwidth, Latency
--------------------------------------------------------------------------------
0                         | 19.02 GiB/s, 513378ns
1                         | 19.09 GiB/s, 511684ns
2                         | 18.65 GiB/s, 523634ns
3                         | 19.19 GiB/s, 508946ns
4                         | 19.17 GiB/s, 509530ns
5                         | 19.08 GiB/s, 511795ns
6                         | 18.09 GiB/s, 539785ns
7                         | 18.95 GiB/s, 515232ns
8                         | 18.90 GiB/s, 516781ns
9                         | 19.04 GiB/s, 512999ns

Also, this discussion makes me realize the performance drop we see in https://raw.githack.com/pentschev/ucx-py-ci/test-results/assets/ucx-py-bandwidth.html during 1.12.0 is actually because we introduced the UCX_RNDV_RAILS=1 override in UCX-Py, which clearly is a poor choice for CPU-only, whereas for GPUs it is a better option as it would prevent part of the message going through a suboptimal path (e.g., when NVLink is available but UCX would still send the remaining of the message via InfiniBand or even TCP).