iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.

Home Page:http://iree.dev/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Re-enable w7900 CI jobs when the runner is stable again

ScottTodd opened this issue · comments

To improve stability, we can try

  • updating GPU drivers
  • adding more runners
  • running a generic sanity check (e.g. rocm-smi) before any test actions
  • dumping logs (dmesg) if errors are detected (assuming nothing sensitive is in the logs)

test_amd_w7900 is still disabled:

# TODO(saienduri): re-enable when iree/hal/drivers/hip/dynamic_symbols_test is fixed
# test_amd_w7900:
# needs: [setup, build_all]
# if: contains(fromJson(needs.setup.outputs.enabled-jobs), 'test_amd_w7900')
# env:
# BUILD_DIR: build-tests
# INSTALL_DIR: ${{ needs.build_all.outputs.install-dir }}
# INSTALL_DIR_ARCHIVE: ${{ needs.build_all.outputs.install-dir-archive }}
# INSTALL_DIR_GCS_URL: ${{ needs.build_all.outputs.install-dir-gcs-url }}
# IREE_CPU_DISABLE: 1
# IREE_VULKAN_DISABLE: 0
# IREE_CUDA_DISABLE: 1
# IREE_HIP_DISABLE: 0
# IREE_HIP_TEST_TARGET_CHIP: "gfx1100"
# runs-on: nodai-amdgpu-w7900-x86-64
# steps:
# - name: "Checking out repository"
# uses: actions/checkout@ac593985615ec2ede58e132d2e21d2b1cbd6127c # v3.3.0
# - name: "Checking out runtime submodules"
# run: ./build_tools/scripts/git/update_runtime_submodules.sh
# - name: "Downloading install dir archive"
# run: wget "${INSTALL_DIR_GCS_URL}" -O "${INSTALL_DIR_ARCHIVE}"
# - name: "Extracting install directory"
# run: tar -xf "${INSTALL_DIR_ARCHIVE}"
# - name: "Building tests"
# run: |
# ./build_tools/pkgci/build_tests_using_package.sh ${INSTALL_DIR}
# - name: "Running GPU tests"
# env:
# IREE_CTEST_LABEL_REGEX: ^requires-gpu|^driver=vulkan$|^driver=hip$
# IREE_AMD_RDNA3_TESTS_DISABLE: 0
# IREE_NVIDIA_GPU_TESTS_DISABLE: 0
# IREE_NVIDIA_SM80_TESTS_DISABLE: 1
# IREE_MULTI_DEVICE_TESTS_DISABLE: 0
# IREE_CUDA_DISABLE: 1
# IREE_CPU_DISABLE: 1
# IREE_HIP_DISABLE: 0
# run: |
# ./build_tools/cmake/ctest_all.sh ${BUILD_DIR}

due to https://github.com/iree-org/iree/actions/runs/9178357378/job/25238436482#step:7:100

  6/266 Test   #54: iree/hal/drivers/hip/dynamic_symbols_test ...........................................................***Failed    2.12 sec
[==========] Running 3 tests from 2 test suites.
[----------] Global test environment set-up.
[----------] 2 tests from DynamicSymbolsTest
[ RUN      ] DynamicSymbolsTest.CreateFromSystemLoader
[       OK ] DynamicSymbolsTest.CreateFromSystemLoader (2096 ms)
[ RUN      ] DynamicSymbolsTest.SearchPathsFail
[       OK ] DynamicSymbolsTest.SearchPathsFail (0 ms)
[----------] 2 tests from DynamicSymbolsTest (2096 ms total)

[----------] 1 test from NCCLDynamicSymbolsTest
[ RUN      ] NCCLDynamicSymbolsTest.CreateFromSystemLoader
iree/runtime/src/iree/hal/drivers/hip/dynamic_symbols_test.cc:92: Failure
Expected equality of these values:
  21803
  nccl_version
    Which is: 21806

[  FAILED  ] NCCLDynamicSymbolsTest.CreateFromSystemLoader (2 ms)
[----------] 1 test from NCCLDynamicSymbolsTest (2 ms total)

[----------] Global test environment tear-down
[==========] 3 tests from 2 test suites ran. (2099 ms total)
[  PASSED  ] 2 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] NCCLDynamicSymbolsTest.CreateFromSystemLoader

 1 FAILED TEST

@erman-gurses were you going to help re-enable this with fixes for iree/hal/drivers/hip/dynamic_symbols_test?

@ScottTodd Will try to take a look at it this week.

cc @antiagainst @nithinsubbiah , codeowners for /runtime/src/iree/hal/drivers/hip/:

/runtime/src/iree/hal/drivers/hip/ @antiagainst @nithinsubbiah

iree/hal/drivers/hip/dynamic_symbols_test is failing on CI so the entire job we added to test hip on w7900s has been disabled for 3 weeks. @erman-gurses may have time to debug since he helped set up the test runner, but these components need a maintainer.

You can put me down as the owner however I can start looking at this only next week.

This still needs attention. Just got another report of a similar test failure on Discord here. Logs: https://github.com/iree-org/iree/actions/runs/9511978721/job/26219408116?pr=17659#step:7:162

This version check

ASSERT_EQ(NCCL_VERSION_CODE, nccl_version);

Is checking equality against this version
#define NCCL_VERSION_CODE 21803

when it should be checking a minimum version instead

cc @sogartar

@ScottTodd, there is a fix for it in this PR #17433 but it has been blocked because of docker image problems for a while now. I will open a PR with only this fix.

@ScottTodd, there is a fix for it in this PR #17433 but it has been blocked because of docker image problems for a while now. I will open a PR with only this fix.

Already done: #17674

Please keep PRs focused on a single issue so combined PRs don't end up sitting for long periods.