miurahr / py7zr

7zip in python3 with ZStandard, PPMd, LZMA2, LZMA1, Delta, BCJ, BZip2, and Deflate compressions, and AES encryption.

Home Page:https://pypi.org/project/py7zr/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Slow gc.collect() on close()

capyvara opened this issue · comments

Specially when reading many small files, the time spent on gc.collect() is greater than the time decompressing data.

Example, 500 ~90kb files (600kb uncompressed) , LZMA, it can only read ~12 files per second.

  0%|▏                | 500/472075 [00:41<10:45:57, 12.17it/s]
         2249416 function calls (2217648 primitive calls) in 44.815 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    883/1    0.003    0.000   44.816   44.816 {built-in method builtins.exec}
...
      512    0.001    0.000   36.190    0.071 py7zr.py:408(__exit__)
      512    0.003    0.000   36.189    0.071 py7zr.py:1081(close)
      512    0.005    0.000   36.167    0.071 py7zr.py:816(_var_release)
      512   36.161    0.071   36.161    0.071 {built-in method gc.collect}
      512    0.001    0.000    3.510    0.007 py7zr.py:970(readall)
      512    0.017    0.000    3.508    0.007 py7zr.py:524(_extract)
      512    0.003    0.000    3.363    0.007 py7zr.py:1203(extract)
      512    0.002    0.000    3.283    0.006 py7zr.py:1271(extract_single)
      512    0.042    0.000    3.280    0.006 py7zr.py:1298(_extract_single)
      502    0.063    0.000    3.223    0.006 py7zr.py:1377(decompress)
     1292    0.037    0.000    3.071    0.002 compressor.py:681(decompress)
     1292    0.006    0.000    2.993    0.002 compressor.py:652(_decompress)
     1292    0.002    0.000    2.987    0.002 compressor.py:559(decompress)
     1292    2.985    0.002    2.985    0.002 {method 'decompress' of '_lzma.LZMADecompressor' objects}

Commenting out the gc.collect() on _var_release() ref it speeds up to 138 files per second, 10x faster

  0%|▏                    | 500/472075 [00:03<56:33, 138.97it/s]
         1988045 function calls (1956277 primitive calls) in 7.341 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    883/1    0.003    0.000    7.341    7.341 {built-in method builtins.exec}
...
      512    0.001    0.000    2.473    0.005 py7zr.py:970(readall)
      512    0.011    0.000    2.472    0.005 py7zr.py:524(_extract)
      512    0.002    0.000    2.356    0.005 py7zr.py:1203(extract)
      5/1    0.034    0.007    2.329    2.329 {method 'to_pandas' of 'pyarrow.lib._PandasConvertible' objects}
        1    0.004    0.004    2.329    2.329 pandas_compat.py:797(table_to_blockmanager)
      512    0.001    0.000    2.300    0.004 py7zr.py:1271(extract_single)
      512    0.011    0.000    2.298    0.004 py7zr.py:1298(_extract_single)
      502    0.014    0.000    2.274    0.005 py7zr.py:1377(decompress)
        1    0.000    0.000    2.210    2.210 pandas_compat.py:1165(_table_to_blocks)
        1    2.209    2.209    2.209    2.209 {pyarrow.lib.table_to_blocks}
     1292    0.024    0.000    2.185    0.002 compressor.py:681(decompress)
     1292    0.004    0.000    2.134    0.002 compressor.py:652(_decompress)
     1292    0.001    0.000    2.130    0.002 compressor.py:559(decompress)

IMHO it should not be in the lib responsibility to force invoke a a manual GC, it's the user that should handle its own memory the lib should only ensure no memory leaks on their own.

I see a commit about one year ago that added that, not sure what was the case it was trying to solve.

Environment (please complete the following information):

  • OS: MacOS 12.5
  • Python 3.9.13
  • py7zr version: 0.20.0

Good catch!

Strange... There is no big difference in the score of benchmark...
#297 (comment)

Any wrong things?

Here is a test code. @capyvara could you improve the benchmark test code? I think a target data is relatively small than your condition.

https://github.com/miurahr/py7zr/blob/master/tests/test_benchmark.py

@miurahr I'm not fully sure we can test this automatically because gc.collect() performance is dependent on the environment, on my context I was with a 500k lines pandas dataframe loaded, maybe allocate a huge amount of dummy objects?

Testing inside a Jupyter notebook was even worse, it was reading something like 1 item/s.