JarryShaw / PyPCAPKit

Python-based Comprehensive Network Packet Analysis Library

Home Page:https://jarryshaw.github.io/PyPCAPKit/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

InfoClass instanciation calls `os.stat`, ruining performance

59e5aaf4 opened this issue · comments

(thanks for having written the best PCAP parsing library out there !)

The bug

Problem: pcap parsing is slow, it takes 1s to parse a single .pcap file with 1 packet.

When parsing packet layers (our use case is TCP reassembly with pcapkit.extract().reassembly.tcp), InfoClass objects (Info) are instanciated quite a large number of times, as expected.

In the initialization code of said classes ( https://github.com/JarryShaw/PyPCAPKit/blob/master/src/corekit/infoclass.py#L55 ) , a call to the validations.dict_check is usually performed.

validations.dict_check itself calls inspect.stack() (here: https://github.com/JarryShaw/PyPCAPKit/blob/master/src/utilities/validations.py#L128 ) , and here lies the problem : inspect.stack() performs causes os.stat to be called.

See the bottom of the following trace graph,

python3 -m cProfile -o /tmp/parse_pcap.pstats ./my-script.py
gprof2dot -f pstats /tmp/parse_pcap.pstats | dot -Tpng -o /tmp/parse_pcap.png

parse_pcap

System information
A clear and concise description of your system information.

  • OS Version:CYGWIN_NT-10.0 3.0.7(0.338/5/3) 2019-04-30 18:08 unknown unknown Cygwin (uname -srvpio)
  • Python Version: 3.6.8
  • Python Implementation: CPython [GCC 7.4.0] on cygwin

Traceback stack
Run program again with PCAPKIT_DEVMODE=true set to provide the traceback stack.

It's not a crash, but to reproduce this I just had to ctrl-C since 90% of the code execution is spent there :)

Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/lib/python3.6/cProfile.py", line 161, in <module>
    main()
  File "/usr/lib/python3.6/cProfile.py", line 154, in main
    runctx(code, globs, None, options.outfile, options.sort)
  File "/usr/lib/python3.6/cProfile.py", line 20, in runctx
    filename, sort)
  File "/usr/lib/python3.6/profile.py", line 64, in runctx
    prof.runctx(statement, globals, locals)
  File "/usr/lib/python3.6/cProfile.py", line 100, in runctx
    exec(cmd, globals, locals)

  File "redacted.py", line 316, in operate
    p = pcap.PCAP(filename = self.sf.get_pcap_path(query))
  File "redacted/pcap.py", line 90, in __init__
    'tcp': True,
  File "/usr/lib/python3.6/site-packages/pcapkit/interface/__init__.py", line 131, in extract
    trace_byteorder=trace_byteorder, trace_nanosecond=trace_nanosecond)
  File "/usr/lib/python3.6/site-packages/pcapkit/foundation/extraction.py", line 533, in __init__
    self.run()                      # start extraction
  File "/usr/lib/python3.6/site-packages/pcapkit/foundation/extraction.py", line 261, in run
    self.record_frames()            # read frames
  File "/usr/lib/python3.6/site-packages/pcapkit/foundation/extraction.py", line 376, in record_frames
    self._read_frame()
  File "/usr/lib/python3.6/site-packages/pcapkit/foundation/extraction.py", line 612, in _read_frame
    return self._default_read_frame()
  File "/usr/lib/python3.6/site-packages/pcapkit/foundation/extraction.py", line 629, in _default_read_frame
    layer=self._exlyr, protocol=self._exptl, nanosecond=self._nnsec)
  File "/usr/lib/python3.6/site-packages/pcapkit/protocols/pcap/frame.py", line 158, in __init__
    self._info = Info(self.read_frame())
  File "/usr/lib/python3.6/site-packages/pcapkit/corekit/infoclass.py", line 56, in __new__
    self.__dict__.update(__read__(dict_))
  File "/usr/lib/python3.6/site-packages/pcapkit/corekit/infoclass.py", line 38, in __read__
    __dict__[key] = Info(value)
  File "/usr/lib/python3.6/site-packages/pcapkit/corekit/infoclass.py", line 55, in __new__
    dict_check(dict_)
  File "/usr/lib/python3.6/site-packages/pcapkit/utilities/validations.py", line 128, in dict_check
    func = func or inspect.stack()[2][3]
  File "/usr/lib/python3.6/inspect.py", line 1501, in stack
    return getouterframes(sys._getframe(1), context)
  File "/usr/lib/python3.6/inspect.py", line 1478, in getouterframes
    frameinfo = (frame,) + getframeinfo(frame, context)
  File "/usr/lib/python3.6/inspect.py", line 1452, in getframeinfo
    lines, lnum = findsource(frame)
  File "/usr/lib/python3.6/inspect.py", line 768, in findsource
    file = getsourcefile(object)
  File "/usr/lib/python3.6/inspect.py", line 693, in getsourcefile
    if os.path.exists(filename):
  File "/usr/lib/python3.6/genericpath.py", line 19, in exists
    os.stat(path)
KeyboardInterrupt
> /usr/lib/python3.6/genericpath.py(19)exists()
-> os.stat(path)
(Pdb)

Expected behavior

Ideally, pcapkit would not call posix.stat so much because it would chose not to rely on inspect features. That would make it significantly faster, probably ?

But I guess there's a reason for inspect to be used here, as well as for the whole https://github.com/JarryShaw/PyPCAPKit/blob/master/src/utilities/validations.py file. And now that I see https://github.com/JarryShaw/PyPCAPKit/blob/master/src/utilities/validations.py#L10 , I'm starting to think you foresaw this potential problem.

May I suggest moving func = func or inspect.stack()[2][3] after the inner check of each validation ? So that it's only called if the check failed. (Otherwise func seems not to be used, but I might be wrong)

Thanks for reading, and thanks again for the (awesome) lib !

Indeed, placing all func = func or inspect.stack()[2][3] at the end of these xxx_check accelerates (10x or so) the speed on my host (I don't have proper metrics for this, yet).

While the first entries can simply be edited, the fragmentation-related checks seem to be more complex. inspect.stack()[2][3] seems to point to the function name of 2 functions above in the stack, which is only ever used when raising an Exception.

Question: since when an Exception is raised, the entire call stack is printed by the native Python error handler, do we even need to collect these func= things ?

Additionally, it seems the parsing-related exceptions raised when attempting to decode FTP/HTTP (I guess you're using exceptions as expected signals of inability to decode a protocol) are as well causing os.stat calls, in utilities/exceptions.py:37:stacklevel() :

parse_pcap_switched

Indeed, moving the BaseError's __init__() : index = stacklevel() to the inner condition (after if not quiet) removes all calls to os.stat !

parse_pcap_switched_index

I'll see if I can create a pull request.

OMG! Thanks for your comprehensive issue! I'm working on other projects recently, so not much focusing here.

Currently, on my TODO list, I need to revise the protocol extraction (pcapkit.protocols.procotol.Protocol and its subclasses) of PyPCAPKit to speed it up a little bit, whilst I'm actually considering removing runtime validators since they are just legacy codes from my PyNTLib project.

Should you only uses the reassembly function of PyPCAPKit, it is highly recommended to collaborate it with DPKT, which is the fastest PCAP extraction solution implemented in Python as far as I know.

Anyway, I'd always appreciate for your feedback and help~ 😄

Well, I'm not sure if I'll ever be able to publicly push these small changes, but here are two informations :

  1. speed changes are (as expected) significant : ×57 to ×144 !

(Y : time per PCAP , X : PCAP number)

a

  1. changes are really trivial as below, it's just a matter of moving or removing some calls. Doesn't seem to break anything (?? yolo).

V1 : just remove the inspect calls in *_check checks

diff --git a/src/utilities/validations.py b/src/utilities/validations.py
index 4764a77..5972121 100644
--- a/src/utilities/validations.py
+++ b/src/utilities/validations.py
@@ -35,120 +35,120 @@ __all__ = [

 def int_check(*args, func=None):
     """Check if arguments are integrals."""
-    func = func or inspect.stack()[2][3]
     for var in args:
         if not isinstance(var, numbers.Integral):
             name = type(var).__name__
+            func = func or inspect.stack()[2][3]
             raise ComplexError(
                 f'Function {func} expected integral number, {name} got instead.')

V2 : also remove the inspect calls in reassembly checks

diff --git a/src/utilities/validations.py b/src/utilities/validations.py
index 5972121..a0feb89 100644
--- a/src/utilities/validations.py
+++ b/src/utilities/validations.py
@@ -187,7 +187,6 @@ def enum_check(*args, func=None):

 def frag_check(*args, protocol, func=None):
     """Check if arguments are valid fragments."""
-    func = func or inspect.stack()[2][3]
     if 'IP' in protocol:
         _ip_frag_check(*args, func=func)
     elif 'TCP' in protocol:

Have a good day,

Finally, this issue will be fixed in v0.14.3.