Re-enable w7900 CI jobs when the runner is stable again
ScottTodd opened this issue · comments
To improve stability, we can try
- updating GPU drivers
- adding more runners
- running a generic sanity check (e.g.
rocm-smi
) before any test actions - dumping logs (dmesg) if errors are detected (assuming nothing sensitive is in the logs)
test_amd_w7900
is still disabled:
Lines 432 to 470 in 2587078
due to https://github.com/iree-org/iree/actions/runs/9178357378/job/25238436482#step:7:100
6/266 Test #54: iree/hal/drivers/hip/dynamic_symbols_test ...........................................................***Failed 2.12 sec
[==========] Running 3 tests from 2 test suites.
[----------] Global test environment set-up.
[----------] 2 tests from DynamicSymbolsTest
[ RUN ] DynamicSymbolsTest.CreateFromSystemLoader
[ OK ] DynamicSymbolsTest.CreateFromSystemLoader (2096 ms)
[ RUN ] DynamicSymbolsTest.SearchPathsFail
[ OK ] DynamicSymbolsTest.SearchPathsFail (0 ms)
[----------] 2 tests from DynamicSymbolsTest (2096 ms total)
[----------] 1 test from NCCLDynamicSymbolsTest
[ RUN ] NCCLDynamicSymbolsTest.CreateFromSystemLoader
iree/runtime/src/iree/hal/drivers/hip/dynamic_symbols_test.cc:92: Failure
Expected equality of these values:
21803
nccl_version
Which is: 21806
[ FAILED ] NCCLDynamicSymbolsTest.CreateFromSystemLoader (2 ms)
[----------] 1 test from NCCLDynamicSymbolsTest (2 ms total)
[----------] Global test environment tear-down
[==========] 3 tests from 2 test suites ran. (2099 ms total)
[ PASSED ] 2 tests.
[ FAILED ] 1 test, listed below:
[ FAILED ] NCCLDynamicSymbolsTest.CreateFromSystemLoader
1 FAILED TEST
@erman-gurses were you going to help re-enable this with fixes for iree/hal/drivers/hip/dynamic_symbols_test
?
@ScottTodd Will try to take a look at it this week.
cc @antiagainst @nithinsubbiah , codeowners for /runtime/src/iree/hal/drivers/hip/
:
Line 83 in 58feff3
iree/hal/drivers/hip/dynamic_symbols_test
is failing on CI so the entire job we added to test hip on w7900s has been disabled for 3 weeks. @erman-gurses may have time to debug since he helped set up the test runner, but these components need a maintainer.
You can put me down as the owner however I can start looking at this only next week.
This still needs attention. Just got another report of a similar test failure on Discord here. Logs: https://github.com/iree-org/iree/actions/runs/9511978721/job/26219408116?pr=17659#step:7:162
This version check
Is checking equality against this version
Line 20 in 3428231
when it should be checking a minimum version instead
cc @sogartar
@ScottTodd, there is a fix for it in this PR #17433 but it has been blocked because of docker image problems for a while now. I will open a PR with only this fix.
@ScottTodd, there is a fix for it in this PR #17433 but it has been blocked because of docker image problems for a while now. I will open a PR with only this fix.
Already done: #17674
Please keep PRs focused on a single issue so combined PRs don't end up sitting for long periods.