wlanslovenija / tunneldigger

L2TPv3 VPN tunneling solution

Home Page:http://tunneldigger.readthedocs.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Broker crashes, running out of file descriptors

RalfJung opened this issue · comments

After 5-6h of uptime, the tunneldigger broker quits with the following error:

Oct 30 03:24:41 gw3.saar.freifunk.net python[9594]: Traceback (most recent call last):
Oct 30 03:24:41 gw3.saar.freifunk.net python[9594]: File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
Oct 30 03:24:41 gw3.saar.freifunk.net python[9594]: "__main__", fname, loader, pkg_name)
Oct 30 03:24:41 gw3.saar.freifunk.net python[9594]: File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
Oct 30 03:24:41 gw3.saar.freifunk.net python[9594]: exec code in run_globals
Oct 30 03:24:41 gw3.saar.freifunk.net python[9594]: File "/opt/tunneldigger/lib/python2.7/site-packages/tunneldigger_broker-0.3.0-py2.7-linux-x86_64.egg/tunneldigger_broker/main.py", line 113, in <module>
Oct 30 03:24:41 gw3.saar.freifunk.net python[9594]: event_loop.start()
Oct 30 03:24:41 gw3.saar.freifunk.net python[9594]: File "/opt/tunneldigger/local/lib/python2.7/site-packages/tunneldigger_broker-0.3.0-py2.7-linux-x86_64.egg/tunneldigger_broker/eventloop.py", line 59, in start
Oct 30 03:24:41 gw3.saar.freifunk.net python[9594]: pollable.read(file_object)
Oct 30 03:24:41 gw3.saar.freifunk.net python[9594]: File "/opt/tunneldigger/local/lib/python2.7/site-packages/tunneldigger_broker-0.3.0-py2.7-linux-x86_64.egg/tunneldigger_broker/network.py", line 98, in read
Oct 30 03:24:41 gw3.saar.freifunk.net python[9594]: callback()
Oct 30 03:24:41 gw3.saar.freifunk.net python[9594]: File "/opt/tunneldigger/local/lib/python2.7/site-packages/tunneldigger_broker-0.3.0-py2.7-linux-x86_64.egg/tunneldigger_broker/tunnel.py", line 230, in pmtu_di
Oct 30 03:24:41 gw3.saar.freifunk.net python[9594]: self.create_timer(self.pmtu_discovery, timeout=random.randrange(2, 5))
Oct 30 03:24:41 gw3.saar.freifunk.net python[9594]: File "/opt/tunneldigger/local/lib/python2.7/site-packages/tunneldigger_broker-0.3.0-py2.7-linux-x86_64.egg/tunneldigger_broker/network.py", line 83, in create_
Oct 30 03:24:41 gw3.saar.freifunk.net python[9594]: timer = timerfd.create(timerfd.CLOCK_MONOTONIC)
Oct 30 03:24:41 gw3.saar.freifunk.net python[9594]: File "/opt/tunneldigger/local/lib/python2.7/site-packages/tunneldigger_broker-0.3.0-py2.7-linux-x86_64.egg/tunneldigger_broker/timerfd.py", line 117, in create
Oct 30 03:24:41 gw3.saar.freifunk.net python[9594]: ret = libc.timerfd_create(clock_id, flags)
Oct 30 03:24:41 gw3.saar.freifunk.net python[9594]: File "/opt/tunneldigger/local/lib/python2.7/site-packages/tunneldigger_broker-0.3.0-py2.7-linux-x86_64.egg/tunneldigger_broker/timerfd.py", line 103, in errche
Oct 30 03:24:41 gw3.saar.freifunk.net python[9594]: raise OSError(errno, os.strerror(errno))
Oct 30 03:24:41 gw3.saar.freifunk.net python[9594]: OSError: [Errno 24] Too many open files

This has now happened twice. The first time, there were 150 tunnels connected constantly. The second time, the number of tunnels slowly went up from 0 to 70 before the crash.

Looking at the process (/proc/$PID/fd) right now (with about 70 active connections) shows 1019 file descriptors. 144 are anon_inode:[timerfd], and 800 of them are pipe:[...]. 75 are socket:[...]. So it seems these pipes (whatever they are) are actually much worse than the timerfds.

Grepping for pipes shows that the hooks use a pipe. And they never seem to close it.

What is the reason not to use subprocess.check_output?

What is the reason not to use subprocess.check_output?

That would block the event loop. We need non-blocking file descriptors, which we register in the event loop, which uses epoll.

It looks like HookProcess.close doesn't close the pipe file descriptors after unregistering them from the event loop. Could you try adding self.process.stdout.close() (and the same for stderr) at the end of close?

Could you try adding self.process.stdout.close() (and the same for stderr) at the end of close?

Already done and rolled out to one server and confirmed to get rid of the hundreds of pipes. :) See #59.

After 12h of uptime, we now have the following FD usage:

# ls -lah | fgrep socket -c
143
root@gw3:/proc/6638/fd# ls -lah | fgrep timer -c
278

Looks like indeed the big leak got fixed.