Clarification on the HSA_FORCE_FINE_GRAIN_PCIE requirement
tmh97 opened this issue · comments
Hey folks, Tom here from Cornelis networks.
We've begun using the rccl-tests
suite to test the functionality of our libfabric provideropx
+aws-rccl-plugin
We successfully ran these tests with no issue
- all_gather_perf
- all_reduce_perf
- broadcast_perf
- reduce_perf
- reduce_scatter_perf
These tests all fail due to Out Of Memory
- alltoall_perf
- alltoallv_perf
- gather_perf
- scatter_perf
- sendrecv_perf
When I set the HSA_FORCE_FINE_GRAIN_PCIE=1
, all of the failing tests magically pass.
The docs say The HSA_FORCE_FINE_GRAIN_PCIE
environment variable will need to be set to 1 in order to run the unit tests which use fine-grained memory type, however, I am running all of the tests with the Default: Coarse
memory type.
I am hoping for some clarification on why this variable seems to improve behavior? Maybe some of the tests have "fine-grained memory type" by default? Any input would be greatly appreciated, thanks in advance for any help!
What AMD GPU and rocm version are you using? This flag should not be required for rocm version >= 5.7
ah the rocm version was my issue! I was using rocm 5.3.0
Thanks so much for the timely response!