High CPU load due to a single misbehaving client

Question

High CPU load due to a single misbehaving client

RalfJung opened this issue 4 years ago · comments

We occasionally see a client misbehave and establish multiple connections at the same time to all our servers. For some reason, even when there are just around 20 connections per 10 minutes, this causes 100% CPU load by tunneldigger. Python is not the most efficient language, but this seems eccessive -- I'd like to better understand where in the broker all that CPU time is spent. Unfortunately, so far I found no good way to do such an analysis for python (what I am looking for is something like callgrind).

Ralf Jung · Answer 1 · Sun Aug 09 2020 22:23:22 GMT+0800 (China Standard Time)

I did a cProfile run of this (on the live system under the problematic load situation), so now I can start analyzing that profiler data. So far I am not sure what to conclude.

Here's the functions with the highest "cumtime":

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2    0.000    0.000  409.537  204.769 main.py:1(<module>)
     51/1    0.000    0.000  204.769  204.769 {built-in method builtins.exec}
     71/1    0.000    0.000  204.769  204.769 <frozen importlib._bootstrap>:978(_find_and_load)
     71/1    0.000    0.000  204.769  204.769 <frozen importlib._bootstrap>:948(_find_and_load_unlocked)
    104/2    0.000    0.000  204.769  102.384 <frozen importlib._bootstrap>:211(_call_with_frames_removed)
     68/2    0.000    0.000  204.768  102.384 <frozen importlib._bootstrap>:663(_load_unlocked)
     47/2    0.000    0.000  204.768  102.384 <frozen importlib._bootstrap_external>:722(exec_module)
        1   79.314   79.314  203.672  203.672 eventloop.py:44(start)
 60789078  112.203    0.000  112.203    0.000 {method 'poll' of 'select.epoll' objects}
 84148432   11.585    0.000   11.585    0.000 {method 'get' of 'dict' objects}
       22    0.002    0.000    1.448    0.066 tunnel.py:239(close)
       91    0.001    0.000    1.392    0.015 netlink.py:127(send)
       91    0.000    0.000    1.392    0.015 netlink.py:152(send)
       91    1.391    0.015    1.391    0.015 {method 'send' of '_socket.socket' objects}
       22    0.001    0.000    1.388    0.063 l2tp.py:181(session_delete)
        1    0.000    0.000    1.048    1.048 broker.py:194(close)
      481    0.002    0.000    0.421    0.001 network.py:88(read)
      461    0.004    0.000    0.416    0.001 tunnel.py:224(keepalive)
      559    0.003    0.000    0.129    0.000 network.py:154(read)
       66    0.002    0.000    0.085    0.001 hooks.py:136(run_hook)
       71    0.001    0.000    0.080    0.001 broker.py:249(message)
      559    0.002    0.000    0.079    0.000 protocol.py:94(message)
       23    0.000    0.000    0.073    0.003 broker.py:236(create_tunnel)
       23    0.001    0.000    0.073    0.003 broker.py:67(create_tunnel)
       44    0.002    0.000    0.072    0.002 hooks.py:18(__init__)
       45    0.003    0.000    0.071    0.002 subprocess.py:656(__init__)
       22    0.001    0.000    0.066    0.003 tunnel.py:104(setup_tunnel)
       45    0.007    0.000    0.064    0.001 subprocess.py:1383(_execute_child)
      488    0.003    0.000    0.037    0.000 tunnel.py:310(message)
      562    0.036    0.000    0.036    0.000 {built-in method posix.read}
        2    0.000    0.000    0.034    0.017 limits.py:24(configure)
        8    0.000    0.000    0.033    0.004 traffic_control.py:18(tc)

Does that look as expected? Not sure. The number of calls to method 'poll' of 'select.epoll' objects and method 'get' of 'dict' objects seems rather large, in particular the former. Maybe we are just in a too tight epoll loop? I also have an (older) pcap file with traffic from that client, showing that it averages at 440 packets per second, but almost all of these are packets from inside the tunnel; the number of control packets (that tunneldigger would actually interpret) is very low. So maybe tunneldigger has to wake up for data packets as well? I am not sure if the kernel sends those packets to userspace or not. Cc @kaechele

Here are the most-called functions:

 84148432   11.585    0.000   11.585    0.000 {method 'get' of 'dict' objects}
 60789078  112.203    0.000  112.203    0.000 {method 'poll' of 'select.epoll' objects}
6169/6069    0.001    0.000    0.001    0.000 {built-in method builtins.len}
     3474    0.001    0.000    0.001    0.000 {built-in method builtins.isinstance}
     2282    0.001    0.000    0.001    0.000 {built-in method _struct.pack}
     2003    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}
     1922    0.000    0.000    0.000    0.000 {method 'rstrip' of 'str' objects}
     1609    0.001    0.000    0.001    0.000 {built-in method builtins.hasattr}
     1605    0.001    0.000    0.001    0.000 sre_parse.py:233(__next)
     1564    0.000    0.000    0.000    0.000 {method 'startswith' of 'str' objects}
     1531    0.000    0.000    0.000    0.000 {method 'isupper' of 'str' objects}
     1410    0.000    0.000    0.001    0.000 sre_parse.py:254(get)
     1251    0.001    0.000    0.001    0.000 {built-in method time.time}
     1125    0.000    0.000    0.000    0.000 {method 'join' of 'str' objects}
      963    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:222(_verbose_message)
      900    0.001    0.000    0.001    0.000 <frozen importlib._bootstrap_external>:58(<listcomp>)
      900    0.001    0.000    0.002    0.000 <frozen importlib._bootstrap_external>:56(_path_join)
      862    0.001    0.000    0.001    0.000 {built-in method _struct.unpack}
      735    0.000    0.000    0.000    0.000 {built-in method posix.fspath}
      650    0.006    0.000    0.006    0.000 {method 'recvfrom' of '_socket.socket' objects}
      607    0.000    0.000    0.000    0.000 {built-in method builtins.getattr}
      576    0.010    0.000    0.010    0.000 {method 'sendto' of '_socket.socket' objects}
      576    0.001    0.000    0.011    0.000 network.py:115(write)
      576    0.003    0.000    0.015    0.000 network.py:128(write_message)
      562    0.036    0.000    0.036    0.000 {built-in method posix.read}
      559    0.003    0.000    0.129    0.000 network.py:154(read)
      559    0.002    0.000    0.003    0.000 protocol.py:50(parse_message)
      559    0.002    0.000    0.079    0.000 protocol.py:94(message)

The trace file says it was running for 200s, so 3000 calls does not seem excessive at all to me.

Ralf Jung · Answer 2 · Sun Aug 09 2020 22:42:02 GMT+0800 (China Standard Time)

When the high load starts, I am seeing a ton of wakeups in our main epoll loop, that all have event flag 0x8 set -- that's select.EPOLLERR. The events are all for tunnels associated with the misbehaving client.