pytorch / xla

Enabling PyTorch on XLA Devices (e.g. Google TPU)

Home Page:https://pytorch.org/xla

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[CI] Add 'keep-going' label

alanwaketan opened this issue · comments

Upstream has this cool 'keep-going' label that allow CI to continue running after test failures such that we can catch all failures all by once.

Let's add that to our repo too.

@wonjoolee95 has volunteered to work on this as part of his functionalization poc. Let me know if you need help!

Putting down some initial investigations.

As per circleci/circleci-docs#3505, seems like accessing GitHub labels (like keep-going) is not possible through CircleCI jobs but only possible through GitHub actions. PyTorch has migrated to GitHub actions (https://github.com/pytorch/pytorch/tree/master/.github a while back, but we are still using CircleCI jobs (https://github.com/pytorch/xla/tree/master/.circleci). So implementing such feature in our CI will require us to migrate to GitHub actions first.

However, what is possible (for a quick temporary solution) is just to update our run_tests.sh script to not error out on failures as such:

xla/test/run_tests.sh

Lines 8 to 12 in 425da77

CONTINUE_ON_ERROR=true
if [[ "$CONTINUE_ON_ERROR" != "1" ]]; then
set +e
fi
Now, when a developer wants to test their PR against all test suites without erroring out on the the first failure, they can manually update this CONTINUE_ON_ERROR to true and submit a PR. Then the CirlcleCI tests will keep going on failures. And then the developer should set this back to false before merging their PR. On this bit, maybe we can introduce a lint job on master to make this flag is set to false.