CI runner for test_nvidia_a100 is running out of disk space

Question

CI runner for test_nvidia_a100 is running out of disk space

ScottTodd opened this issue 5 days ago · comments

The test_nvidia_a100 CI job has been failing to download the docker image with

docker: failed to register layer: write /var/cuda-repo-ubuntu2004-12-2-local/nsight-systems-2023.2.3_2023.2.3.1001-1_amd64.deb: no space left on device.

Sample logs: https://github.com/iree-org/iree/actions/runs/9519620228/job/26243514700#step:8:60

Debugging shows that the postsubmit runner (iree-persistent-a100-2) has a 100GB disk: https://github.com/iree-org/iree/actions/runs/9520428877/job/26245718633#step:4:11
Compared to a 1TB disk for the presubmit runner: https://github.com/iree-org/iree/actions/runs/9520428877/job/26245718166#step:4:11

Can we recreate the runner with a larger disk? 100GB will be tight to fit a 12GB docker image, build artifacts, and other files.

Nancy Yuen · Answer 1 · Sat Jun 15 2024 03:48:49 GMT+0800 (China Standard Time)

I can try but it might take time. For some reason these are created with a single disk that is also the boot disk. So I have to shutdown the instance to change the disk and it may take time to get another A100.

Scott Todd · Answer 2 · Tue Jun 18 2024 00:09:30 GMT+0800 (China Standard Time)

Postsubmit tests started passing again as of 1ea21d1 (2/2 postsubmit runs passed so far)

Was the runner updated? Even if not, things might be fine for now?

Scott Todd · Answer 3 · Wed Jun 19 2024 02:07:12 GMT+0800 (China Standard Time)

Seems to be stable now.

Nancy Yuen · Answer 4 · Wed Jun 19 2024 03:42:39 GMT+0800 (China Standard Time)

No I haven't had a chance to redo the runner yet. Was hoping to do it this afternoon. Should I just leave it alone?