actions / runner-images

GitHub Actions runner images

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CRITICAL BUG // DefaultLimitNOFILESoft=1024

ladar opened this issue · comments

Description

My GitHub actions started failing on 7/22/22... after lots of troubleshooting, I realized a new virtual environment was pushed between my successful run on the 21st and the failures starting on the 22nd.

I can't check what the limits were on the previous Ubuntu image, but for the current one, the build steps are launching with systemd limits of:

DefaultLimitNOFILE=65536
DefaultLimitNOFILESoft=1024

Note the soft limit of 1024. After lots of troubleshooting, I believe it's that limit which is causing my jobs to fail. I'm testing Packer templates, and that tool uses Go. Every step launches a new process. Because my templates have gotten rather larger, this results in several hundred processes, all of which use unix sockets ot communicate with each other. The result is thousands of open file handles. I believe the limiter is what is causing the templates to fail, since 1024 is far below what is needed.

Platforms affected

  • Azure DevOps
  • GitHub Actions

Virtual environments affected

  • Ubuntu 18.04
  • Ubuntu 20.04
  • Ubuntu 22.04
  • macOS 10.15
  • macOS 11
  • macOS 12
  • Windows Server 2019
  • Windows Server 2022

Image version and build link

Environment: ubuntu-20.04
Version: 20220717.1
Included Software: https://github.com/actions/virtual-environments/blob/ubuntu20/20220717.1/images/linux/Ubuntu2004-Readme.md
Image Release: https://github.com/actions/virtual-environments/releases/tag/ubuntu20%2F20220717.1

Is it regression?

Yes. 20220717.1

Expected behavior

My GitHub actions should pass!

Actual behavior

My GitHub actions now fail!

Repro steps

This example would need to checkout the lavabit/robox repository.

name: Robox Validate

jobs:
  Build:
    runs-on: ubuntu-20.04
    env:
        LANG: en_US.UTF-8
        LANGUAGE: en_US:en
        LC_ALL: en_US.UTF-8
    steps:
    - uses: actions/checkout@master
    - name: Install Dependencies
      env: 
        DEBIAN_FRONTEND: noninteractive
        DEBCONF_NONINTERACTIVE_SEEN: true
      run: |
        curl -fsSL https://apt.releases.hashicorp.com/gpg | sudo apt-key add -
        sudo apt-add-repository "deb [arch=amd64] https://apt.releases.hashicorp.com $(lsb_release -cs) main"
        sudo apt-get update
        sudo apt-get --assume-yes install packer
    - name: Validate Generic Box Configurations
      env:
        GOGC: 50
        PACKER_LOG: 1
        GOMAXPROCS: 1
        VERSION: 1.0.0
      run: |
        date +"%nStarting generic box validation at %r on %x%n"
        export PACKER_LOG_PATH=generic-docker.txt ; packer validate generic-docker.json &>> packer-validate.txt && printf "File  + generic-docker.json\n" || { printf "File  - generic-docker.json\n" ; exit 1 ; }
        export PACKER_LOG_PATH=generic-hyperv.txt ; packer validate generic-hyperv.json &>> packer-validate.txt && printf "File  + generic-hyperv.json\n" || { printf "File  - generic-hyperv.json\n" ; exit 1 ; }
        export PACKER_LOG_PATH=generic-libvirt.txt ; packer validate generic-libvirt.json &>> packer-validate.txt && printf "File  + generic-libvirt.json\n" || { printf "File  - generic-libvirt.json\n" ; exit 1 ; }
        export PACKER_LOG_PATH=generic-parallels.txt ; packer validate generic-parallels.json &>> packer-validate.txt && printf "File  + generic-parallels.json\n" || { printf "File  - generic-parallels.json\n" ; exit 1 ; }
        export PACKER_LOG_PATH=generic-virtualbox.txt ; packer validate generic-virtualbox.json &>> packer-validate.txt && printf "File  + generic-virtualbox.json\n" || { printf "File  - generic-virtualbox.json\n" ; exit 1 ; }
        export PACKER_LOG_PATH=generic-vmware.txt ; packer validate generic-vmware.json &>> packer-validate.txt && printf "File  + generic-vmware.json\n" || { printf "File  - generic-vmware.json\n" ; exit 1 ; }
        export PACKER_LOG_PATH=generic-libvirt-x32.txt ; packer validate generic-libvirt-x32.json &>> packer-validate.txt && printf "File  + generic-libvirt-x32.json\n" || { printf "File  - generic-libvirt-x32.json\n" ; exit 1 ; }
        export PACKER_LOG_PATH=generic-virtualbox-x32.txt ; packer validate generic-virtualbox-x32.json &>> packer-validate.txt && printf "File  + generic-virtualbox-x32.json\n" || { printf "File  - generic-virtualbox-x32.json\n" ; exit 1 ; }
        date +"%nFinished generic box validation at %r on %x%n"

Hey @ladar.
You can change this limit:
#4683 (comment)
#3738 (comment)

Default settings:
image

- name: ulimit
  run: |
      sudo prlimit --pid $$ --nofile=500000:500000
      ulimit -n

image

So I don't know definitively what caused the breakage between the 20220717 and 20220722 images. Nothing in the repo commit commit log looked like a culprit. So the issue could be at a lower level.

I don't know which change fixed the problem, but I wasn't able to get the workflow to run reliability until I increased the swap size from 4096 to 8192. Some of my other changes may have also helped, as they reduced the memory footprint. In particular net.unix.max_dgram_qlen combines with net.core.wmem_default and net.core.wmem_max to limit the backlog for unix sockets. Critically though, the Linux kernel requires the entire memory block (qlen*wmem) to be a contiguous for unix sockets. If the host running the image faces memory contention, it might not be able to find a contiguous block, which can cause the symptoms I saw.

The entire fixes:

- name: Increase Limits
  run: |
    sudo sysctl -q vm.overcommit_ratio=100
    sudo sysctl -q net.unix.max_dgram_qlen=64
    sudo prlimit --pid $$ --nproc=65536:65536
    sudo prlimit --pid $$ --nofile=500000:500000
    printf "DefaultLimitNPROC=65536:65536\n" | sudo tee -a /etc/systemd/user.conf > /dev/null || exit 1
    printf "DefaultLimitNPROC=65536:65536\n" | sudo tee -a /etc/systemd/system.conf > /dev/null || exit 1
    printf "DefaultLimitNOFILE=500000:500000\n" | sudo tee -a /etc/systemd/user.conf > /dev/null || exit 1
    printf "DefaultLimitNOFILE=500000:500000\n" | sudo tee -a /etc/systemd/system.conf > /dev/null || exit 1
    sudo systemctl daemon-reload
    systemctl --user daemon-reload
- name: Increase Swap
  run: |
    sudo dd if=/dev/zero of=/swap bs=1M count=4096 status=none
    sudo chmod 600 /swap
    sudo mkswap /swap
    sudo swapon /swap

I had to rerun the prlimit increase commands for each build step like so:

- name: Demonstrate Limit Increases for a Build Step
  env:
    GOGC: 50
    GOMAXPROCS: 1
  run: |
    date +"%nStarting a sample/example build step at %r on %x%n"
    sudo prlimit --pid $$ --nproc=65536:65536
    sudo prlimit --pid $$ --nofile=500000:500000

@al-cheb I saw you closed the bug. Were you able to figure out what changed between the 20220717 and 20220721 images?

I may be wrong, but I came across this issue while trying to remember how I helped a co-worker correct a problem with the open files soft limit previously. What I recall was that while the kernel was configured to support a sufficiently large ulimit maximum, the GitHub Runner Action was spawned from a process in some ancestor unit file that was either explicitly setting or implicitly accepting a reduced soft limit. The way these things inherit, you can always opt into a smaller limit, but you cannot raise your limit higher than what you inherit at process creation time.

In order for our own set of sufficiently large ulimit -nfiles to be effective, we had to modify this ancestor unit file to not restrict itself and its children quite as much.

I don't have that running example environment or access to one similar just now, so while I am sure we fixed his by raising the soft ulimit for open files in some ancestor systemd unit file, I cannot seem to recall which one we had to target for GitHub actions to inherit the correctly all the way back to PID=1.

Anyhow, I see this thread is relatively recent so thought I'd share what I had retained in case it is of some help/value.

This can be a hard issue to track down, because it runs contrary to how nix systems historically handled the open file limit (aka a hard max for the kernel and via the ulimit interface). It's also an issue that gets buried inside the morass that is systemd, which is still an opaque MacGuffin to the average system administrator.

The quickest fix is:

sudo sed -i '/DefaultLimitNOFILE=/d' /etc/systemd/user.conf
printf "DefaultLimitNOFILE=65536\n" | sudo tee -a /etc/systemd/user.conf > /dev/null

sudo sed -i '/DefaultLimitNOFILE=/d' /etc/systemd/system.conf
printf "DefaultLimitNOFILE=65536:524288\n" | sudo tee -a /etc/systemd/system.conf > /dev/null

This will increase the default limits for everything, and save you lots of grief. Of course this assumes your not on a system shared by many, that is resource constricted. If that is the case, you'll need a more granular approach.

To see where the limits are, run this:

 echo 'user'; systemctl --user show | grep NOFILE ; echo 'system' ;sudo sudo systemctl show  | grep NOFILE

And don't forget, the kernel maximum, and the ulimit values still apply, so update those as well, if necessary.