CI runner for test_nvidia_a100 is running out of disk space
ScottTodd opened this issue · comments
Following up on #17661 (comment)
The test_nvidia_a100
CI job has been failing to download the docker image with
docker: failed to register layer: write /var/cuda-repo-ubuntu2004-12-2-local/nsight-systems-2023.2.3_2023.2.3.1001-1_amd64.deb: no space left on device.
Sample logs: https://github.com/iree-org/iree/actions/runs/9519620228/job/26243514700#step:8:60
Debugging shows that the postsubmit runner (iree-persistent-a100-2
) has a 100GB disk: https://github.com/iree-org/iree/actions/runs/9520428877/job/26245718633#step:4:11
Compared to a 1TB disk for the presubmit runner: https://github.com/iree-org/iree/actions/runs/9520428877/job/26245718166#step:4:11
Can we recreate the runner with a larger disk? 100GB will be tight to fit a 12GB docker image, build artifacts, and other files.
I can try but it might take time. For some reason these are created with a single disk that is also the boot disk. So I have to shutdown the instance to change the disk and it may take time to get another A100.
Postsubmit tests started passing again as of 1ea21d1 (2/2 postsubmit runs passed so far)
Was the runner updated? Even if not, things might be fine for now?
Seems to be stable now.
No I haven't had a chance to redo the runner yet. Was hoping to do it this afternoon. Should I just leave it alone?