CRITICAL BUG // DefaultLimitNOFILESoft=1024
ladar opened this issue · comments
Description
My GitHub actions started failing on 7/22/22... after lots of troubleshooting, I realized a new virtual environment was pushed between my successful run on the 21st and the failures starting on the 22nd.
I can't check what the limits were on the previous Ubuntu image, but for the current one, the build steps are launching with systemd limits of:
DefaultLimitNOFILE=65536
DefaultLimitNOFILESoft=1024
Note the soft limit of 1024. After lots of troubleshooting, I believe it's that limit which is causing my jobs to fail. I'm testing Packer templates, and that tool uses Go. Every step launches a new process. Because my templates have gotten rather larger, this results in several hundred processes, all of which use unix sockets ot communicate with each other. The result is thousands of open file handles. I believe the limiter is what is causing the templates to fail, since 1024 is far below what is needed.
Platforms affected
- Azure DevOps
- GitHub Actions
Virtual environments affected
- Ubuntu 18.04
- Ubuntu 20.04
- Ubuntu 22.04
- macOS 10.15
- macOS 11
- macOS 12
- Windows Server 2019
- Windows Server 2022
Image version and build link
Environment: ubuntu-20.04
Version: 20220717.1
Included Software: https://github.com/actions/virtual-environments/blob/ubuntu20/20220717.1/images/linux/Ubuntu2004-Readme.md
Image Release: https://github.com/actions/virtual-environments/releases/tag/ubuntu20%2F20220717.1
Is it regression?
Yes. 20220717.1
Expected behavior
My GitHub actions should pass!
Actual behavior
My GitHub actions now fail!
Repro steps
This example would need to checkout the lavabit/robox
repository.
name: Robox Validate
jobs:
Build:
runs-on: ubuntu-20.04
env:
LANG: en_US.UTF-8
LANGUAGE: en_US:en
LC_ALL: en_US.UTF-8
steps:
- uses: actions/checkout@master
- name: Install Dependencies
env:
DEBIAN_FRONTEND: noninteractive
DEBCONF_NONINTERACTIVE_SEEN: true
run: |
curl -fsSL https://apt.releases.hashicorp.com/gpg | sudo apt-key add -
sudo apt-add-repository "deb [arch=amd64] https://apt.releases.hashicorp.com $(lsb_release -cs) main"
sudo apt-get update
sudo apt-get --assume-yes install packer
- name: Validate Generic Box Configurations
env:
GOGC: 50
PACKER_LOG: 1
GOMAXPROCS: 1
VERSION: 1.0.0
run: |
date +"%nStarting generic box validation at %r on %x%n"
export PACKER_LOG_PATH=generic-docker.txt ; packer validate generic-docker.json &>> packer-validate.txt && printf "File + generic-docker.json\n" || { printf "File - generic-docker.json\n" ; exit 1 ; }
export PACKER_LOG_PATH=generic-hyperv.txt ; packer validate generic-hyperv.json &>> packer-validate.txt && printf "File + generic-hyperv.json\n" || { printf "File - generic-hyperv.json\n" ; exit 1 ; }
export PACKER_LOG_PATH=generic-libvirt.txt ; packer validate generic-libvirt.json &>> packer-validate.txt && printf "File + generic-libvirt.json\n" || { printf "File - generic-libvirt.json\n" ; exit 1 ; }
export PACKER_LOG_PATH=generic-parallels.txt ; packer validate generic-parallels.json &>> packer-validate.txt && printf "File + generic-parallels.json\n" || { printf "File - generic-parallels.json\n" ; exit 1 ; }
export PACKER_LOG_PATH=generic-virtualbox.txt ; packer validate generic-virtualbox.json &>> packer-validate.txt && printf "File + generic-virtualbox.json\n" || { printf "File - generic-virtualbox.json\n" ; exit 1 ; }
export PACKER_LOG_PATH=generic-vmware.txt ; packer validate generic-vmware.json &>> packer-validate.txt && printf "File + generic-vmware.json\n" || { printf "File - generic-vmware.json\n" ; exit 1 ; }
export PACKER_LOG_PATH=generic-libvirt-x32.txt ; packer validate generic-libvirt-x32.json &>> packer-validate.txt && printf "File + generic-libvirt-x32.json\n" || { printf "File - generic-libvirt-x32.json\n" ; exit 1 ; }
export PACKER_LOG_PATH=generic-virtualbox-x32.txt ; packer validate generic-virtualbox-x32.json &>> packer-validate.txt && printf "File + generic-virtualbox-x32.json\n" || { printf "File - generic-virtualbox-x32.json\n" ; exit 1 ; }
date +"%nFinished generic box validation at %r on %x%n"
Hey @ladar.
You can change this limit:
#4683 (comment)
#3738 (comment)
- name: ulimit
run: |
sudo prlimit --pid $$ --nofile=500000:500000
ulimit -n
So I don't know definitively what caused the breakage between the 20220717 and 20220722 images. Nothing in the repo commit commit log looked like a culprit. So the issue could be at a lower level.
I don't know which change fixed the problem, but I wasn't able to get the workflow to run reliability until I increased the swap size from 4096 to 8192. Some of my other changes may have also helped, as they reduced the memory footprint. In particular net.unix.max_dgram_qlen
combines with net.core.wmem_default
and net.core.wmem_max
to limit the backlog for unix sockets. Critically though, the Linux kernel requires the entire memory block (qlen*wmem) to be a contiguous for unix sockets. If the host running the image faces memory contention, it might not be able to find a contiguous block, which can cause the symptoms I saw.
The entire fixes:
- name: Increase Limits
run: |
sudo sysctl -q vm.overcommit_ratio=100
sudo sysctl -q net.unix.max_dgram_qlen=64
sudo prlimit --pid $$ --nproc=65536:65536
sudo prlimit --pid $$ --nofile=500000:500000
printf "DefaultLimitNPROC=65536:65536\n" | sudo tee -a /etc/systemd/user.conf > /dev/null || exit 1
printf "DefaultLimitNPROC=65536:65536\n" | sudo tee -a /etc/systemd/system.conf > /dev/null || exit 1
printf "DefaultLimitNOFILE=500000:500000\n" | sudo tee -a /etc/systemd/user.conf > /dev/null || exit 1
printf "DefaultLimitNOFILE=500000:500000\n" | sudo tee -a /etc/systemd/system.conf > /dev/null || exit 1
sudo systemctl daemon-reload
systemctl --user daemon-reload
- name: Increase Swap
run: |
sudo dd if=/dev/zero of=/swap bs=1M count=4096 status=none
sudo chmod 600 /swap
sudo mkswap /swap
sudo swapon /swap
I had to rerun the prlimit increase commands for each build step like so:
- name: Demonstrate Limit Increases for a Build Step
env:
GOGC: 50
GOMAXPROCS: 1
run: |
date +"%nStarting a sample/example build step at %r on %x%n"
sudo prlimit --pid $$ --nproc=65536:65536
sudo prlimit --pid $$ --nofile=500000:500000
@al-cheb I saw you closed the bug. Were you able to figure out what changed between the 20220717 and 20220721 images?
I may be wrong, but I came across this issue while trying to remember how I helped a co-worker correct a problem with the open files soft limit previously. What I recall was that while the kernel was configured to support a sufficiently large ulimit maximum, the GitHub Runner Action was spawned from a process in some ancestor unit file that was either explicitly setting or implicitly accepting a reduced soft limit. The way these things inherit, you can always opt into a smaller limit, but you cannot raise your limit higher than what you inherit at process creation time.
In order for our own set of sufficiently large ulimit -nfiles to be effective, we had to modify this ancestor unit file to not restrict itself and its children quite as much.
I don't have that running example environment or access to one similar just now, so while I am sure we fixed his by raising the soft ulimit for open files in some ancestor systemd unit file, I cannot seem to recall which one we had to target for GitHub actions to inherit the correctly all the way back to PID=1.
Anyhow, I see this thread is relatively recent so thought I'd share what I had retained in case it is of some help/value.
This can be a hard issue to track down, because it runs contrary to how nix systems historically handled the open file limit (aka a hard max for the kernel and via the ulimit interface). It's also an issue that gets buried inside the morass that is systemd, which is still an opaque MacGuffin to the average system administrator.
The quickest fix is:
sudo sed -i '/DefaultLimitNOFILE=/d' /etc/systemd/user.conf
printf "DefaultLimitNOFILE=65536\n" | sudo tee -a /etc/systemd/user.conf > /dev/null
sudo sed -i '/DefaultLimitNOFILE=/d' /etc/systemd/system.conf
printf "DefaultLimitNOFILE=65536:524288\n" | sudo tee -a /etc/systemd/system.conf > /dev/null
This will increase the default limits for everything, and save you lots of grief. Of course this assumes your not on a system shared by many, that is resource constricted. If that is the case, you'll need a more granular approach.
To see where the limits are, run this:
echo 'user'; systemctl --user show | grep NOFILE ; echo 'system' ;sudo sudo systemctl show | grep NOFILE
And don't forget, the kernel maximum, and the ulimit values still apply, so update those as well, if necessary.