iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.

Home Page:http://iree.dev/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CI runner for test_nvidia_a100 is running out of disk space

ScottTodd opened this issue · comments

Following up on #17661 (comment)

The test_nvidia_a100 CI job has been failing to download the docker image with

docker: failed to register layer: write /var/cuda-repo-ubuntu2004-12-2-local/nsight-systems-2023.2.3_2023.2.3.1001-1_amd64.deb: no space left on device.

Sample logs: https://github.com/iree-org/iree/actions/runs/9519620228/job/26243514700#step:8:60

Debugging shows that the postsubmit runner (iree-persistent-a100-2) has a 100GB disk: https://github.com/iree-org/iree/actions/runs/9520428877/job/26245718633#step:4:11
Compared to a 1TB disk for the presubmit runner: https://github.com/iree-org/iree/actions/runs/9520428877/job/26245718166#step:4:11

Can we recreate the runner with a larger disk? 100GB will be tight to fit a 12GB docker image, build artifacts, and other files.

I can try but it might take time. For some reason these are created with a single disk that is also the boot disk. So I have to shutdown the instance to change the disk and it may take time to get another A100.

Postsubmit tests started passing again as of 1ea21d1 (2/2 postsubmit runs passed so far)

Was the runner updated? Even if not, things might be fine for now?

Seems to be stable now.

No I haven't had a chance to redo the runner yet. Was hoping to do it this afternoon. Should I just leave it alone?