High CPU load due to a single misbehaving client
RalfJung opened this issue · comments
We occasionally see a client misbehave and establish multiple connections at the same time to all our servers. For some reason, even when there are just around 20 connections per 10 minutes, this causes 100% CPU load by tunneldigger. Python is not the most efficient language, but this seems eccessive -- I'd like to better understand where in the broker all that CPU time is spent. Unfortunately, so far I found no good way to do such an analysis for python (what I am looking for is something like callgrind).
I did a cProfile
run of this (on the live system under the problematic load situation), so now I can start analyzing that profiler data. So far I am not sure what to conclude.
Here's the functions with the highest "cumtime":
ncalls tottime percall cumtime percall filename:lineno(function)
2 0.000 0.000 409.537 204.769 main.py:1(<module>)
51/1 0.000 0.000 204.769 204.769 {built-in method builtins.exec}
71/1 0.000 0.000 204.769 204.769 <frozen importlib._bootstrap>:978(_find_and_load)
71/1 0.000 0.000 204.769 204.769 <frozen importlib._bootstrap>:948(_find_and_load_unlocked)
104/2 0.000 0.000 204.769 102.384 <frozen importlib._bootstrap>:211(_call_with_frames_removed)
68/2 0.000 0.000 204.768 102.384 <frozen importlib._bootstrap>:663(_load_unlocked)
47/2 0.000 0.000 204.768 102.384 <frozen importlib._bootstrap_external>:722(exec_module)
1 79.314 79.314 203.672 203.672 eventloop.py:44(start)
60789078 112.203 0.000 112.203 0.000 {method 'poll' of 'select.epoll' objects}
84148432 11.585 0.000 11.585 0.000 {method 'get' of 'dict' objects}
22 0.002 0.000 1.448 0.066 tunnel.py:239(close)
91 0.001 0.000 1.392 0.015 netlink.py:127(send)
91 0.000 0.000 1.392 0.015 netlink.py:152(send)
91 1.391 0.015 1.391 0.015 {method 'send' of '_socket.socket' objects}
22 0.001 0.000 1.388 0.063 l2tp.py:181(session_delete)
1 0.000 0.000 1.048 1.048 broker.py:194(close)
481 0.002 0.000 0.421 0.001 network.py:88(read)
461 0.004 0.000 0.416 0.001 tunnel.py:224(keepalive)
559 0.003 0.000 0.129 0.000 network.py:154(read)
66 0.002 0.000 0.085 0.001 hooks.py:136(run_hook)
71 0.001 0.000 0.080 0.001 broker.py:249(message)
559 0.002 0.000 0.079 0.000 protocol.py:94(message)
23 0.000 0.000 0.073 0.003 broker.py:236(create_tunnel)
23 0.001 0.000 0.073 0.003 broker.py:67(create_tunnel)
44 0.002 0.000 0.072 0.002 hooks.py:18(__init__)
45 0.003 0.000 0.071 0.002 subprocess.py:656(__init__)
22 0.001 0.000 0.066 0.003 tunnel.py:104(setup_tunnel)
45 0.007 0.000 0.064 0.001 subprocess.py:1383(_execute_child)
488 0.003 0.000 0.037 0.000 tunnel.py:310(message)
562 0.036 0.000 0.036 0.000 {built-in method posix.read}
2 0.000 0.000 0.034 0.017 limits.py:24(configure)
8 0.000 0.000 0.033 0.004 traffic_control.py:18(tc)
Does that look as expected? Not sure. The number of calls to method 'poll' of 'select.epoll' objects
and method 'get' of 'dict' objects
seems rather large, in particular the former. Maybe we are just in a too tight epoll loop? I also have an (older) pcap file with traffic from that client, showing that it averages at 440 packets per second, but almost all of these are packets from inside the tunnel; the number of control packets (that tunneldigger would actually interpret) is very low. So maybe tunneldigger has to wake up for data packets as well? I am not sure if the kernel sends those packets to userspace or not. Cc @kaechele
Here are the most-called functions:
84148432 11.585 0.000 11.585 0.000 {method 'get' of 'dict' objects}
60789078 112.203 0.000 112.203 0.000 {method 'poll' of 'select.epoll' objects}
6169/6069 0.001 0.000 0.001 0.000 {built-in method builtins.len}
3474 0.001 0.000 0.001 0.000 {built-in method builtins.isinstance}
2282 0.001 0.000 0.001 0.000 {built-in method _struct.pack}
2003 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects}
1922 0.000 0.000 0.000 0.000 {method 'rstrip' of 'str' objects}
1609 0.001 0.000 0.001 0.000 {built-in method builtins.hasattr}
1605 0.001 0.000 0.001 0.000 sre_parse.py:233(__next)
1564 0.000 0.000 0.000 0.000 {method 'startswith' of 'str' objects}
1531 0.000 0.000 0.000 0.000 {method 'isupper' of 'str' objects}
1410 0.000 0.000 0.001 0.000 sre_parse.py:254(get)
1251 0.001 0.000 0.001 0.000 {built-in method time.time}
1125 0.000 0.000 0.000 0.000 {method 'join' of 'str' objects}
963 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:222(_verbose_message)
900 0.001 0.000 0.001 0.000 <frozen importlib._bootstrap_external>:58(<listcomp>)
900 0.001 0.000 0.002 0.000 <frozen importlib._bootstrap_external>:56(_path_join)
862 0.001 0.000 0.001 0.000 {built-in method _struct.unpack}
735 0.000 0.000 0.000 0.000 {built-in method posix.fspath}
650 0.006 0.000 0.006 0.000 {method 'recvfrom' of '_socket.socket' objects}
607 0.000 0.000 0.000 0.000 {built-in method builtins.getattr}
576 0.010 0.000 0.010 0.000 {method 'sendto' of '_socket.socket' objects}
576 0.001 0.000 0.011 0.000 network.py:115(write)
576 0.003 0.000 0.015 0.000 network.py:128(write_message)
562 0.036 0.000 0.036 0.000 {built-in method posix.read}
559 0.003 0.000 0.129 0.000 network.py:154(read)
559 0.002 0.000 0.003 0.000 protocol.py:50(parse_message)
559 0.002 0.000 0.079 0.000 protocol.py:94(message)
The trace file says it was running for 200s, so 3000 calls does not seem excessive at all to me.
When the high load starts, I am seeing a ton of wakeups in our main epoll loop, that all have event flag 0x8
set -- that's select.EPOLLERR
. The events are all for tunnels associated with the misbehaving client.