Test HIP failure common.cu:1129 'hipErrorInvalidDevice' while running test
sheetalarkadam opened this issue · comments
When I was testing rccl installation using the rccl-tests I get Test HIP failure common.cu:1129 'hipErrorInvalidDevice'
My configurations:
Device: AMD] Starship/Matisse Reserved SPP
Distribution: "Ubuntu 20.04.4 LTS"
Repro steps:
export HCC_AMDGPU_TARGET=gfx90a
cd home
git clone https://github.com/ROCmSoftwarePlatform/rccl-tests.git
cd rccl-tests && make
./build/all_reduce_perf -b 8 -e 128M -f 2 -g 8
Any idea on what could be wrong with my environment? I was able to run the tests last week.
I really appreciate any help!
Setting --device=/dev/kfd --device=/dev/dri --security-opt seccomp=unconfined --group-add video --cap-add=SYS_PTRACE to docker run command solved the issue