Slow gc.collect() on close()
capyvara opened this issue · comments
Specially when reading many small files, the time spent on gc.collect() is greater than the time decompressing data.
Example, 500 ~90kb files (600kb uncompressed) , LZMA, it can only read ~12 files per second.
0%|▏ | 500/472075 [00:41<10:45:57, 12.17it/s]
2249416 function calls (2217648 primitive calls) in 44.815 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
883/1 0.003 0.000 44.816 44.816 {built-in method builtins.exec}
...
512 0.001 0.000 36.190 0.071 py7zr.py:408(__exit__)
512 0.003 0.000 36.189 0.071 py7zr.py:1081(close)
512 0.005 0.000 36.167 0.071 py7zr.py:816(_var_release)
512 36.161 0.071 36.161 0.071 {built-in method gc.collect}
512 0.001 0.000 3.510 0.007 py7zr.py:970(readall)
512 0.017 0.000 3.508 0.007 py7zr.py:524(_extract)
512 0.003 0.000 3.363 0.007 py7zr.py:1203(extract)
512 0.002 0.000 3.283 0.006 py7zr.py:1271(extract_single)
512 0.042 0.000 3.280 0.006 py7zr.py:1298(_extract_single)
502 0.063 0.000 3.223 0.006 py7zr.py:1377(decompress)
1292 0.037 0.000 3.071 0.002 compressor.py:681(decompress)
1292 0.006 0.000 2.993 0.002 compressor.py:652(_decompress)
1292 0.002 0.000 2.987 0.002 compressor.py:559(decompress)
1292 2.985 0.002 2.985 0.002 {method 'decompress' of '_lzma.LZMADecompressor' objects}
Commenting out the gc.collect()
on _var_release()
ref it speeds up to 138 files per second, 10x faster
0%|▏ | 500/472075 [00:03<56:33, 138.97it/s]
1988045 function calls (1956277 primitive calls) in 7.341 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
883/1 0.003 0.000 7.341 7.341 {built-in method builtins.exec}
...
512 0.001 0.000 2.473 0.005 py7zr.py:970(readall)
512 0.011 0.000 2.472 0.005 py7zr.py:524(_extract)
512 0.002 0.000 2.356 0.005 py7zr.py:1203(extract)
5/1 0.034 0.007 2.329 2.329 {method 'to_pandas' of 'pyarrow.lib._PandasConvertible' objects}
1 0.004 0.004 2.329 2.329 pandas_compat.py:797(table_to_blockmanager)
512 0.001 0.000 2.300 0.004 py7zr.py:1271(extract_single)
512 0.011 0.000 2.298 0.004 py7zr.py:1298(_extract_single)
502 0.014 0.000 2.274 0.005 py7zr.py:1377(decompress)
1 0.000 0.000 2.210 2.210 pandas_compat.py:1165(_table_to_blocks)
1 2.209 2.209 2.209 2.209 {pyarrow.lib.table_to_blocks}
1292 0.024 0.000 2.185 0.002 compressor.py:681(decompress)
1292 0.004 0.000 2.134 0.002 compressor.py:652(_decompress)
1292 0.001 0.000 2.130 0.002 compressor.py:559(decompress)
IMHO it should not be in the lib responsibility to force invoke a a manual GC, it's the user that should handle its own memory the lib should only ensure no memory leaks on their own.
I see a commit about one year ago that added that, not sure what was the case it was trying to solve.
Environment (please complete the following information):
- OS: MacOS 12.5
- Python 3.9.13
- py7zr version: 0.20.0
Good catch!
Strange... There is no big difference in the score of benchmark...
#297 (comment)
Any wrong things?
Here is a test code. @capyvara could you improve the benchmark test code? I think a target data is relatively small than your condition.
https://github.com/miurahr/py7zr/blob/master/tests/test_benchmark.py
@miurahr I'm not fully sure we can test this automatically because gc.collect()
performance is dependent on the environment, on my context I was with a 500k lines pandas dataframe loaded, maybe allocate a huge amount of dummy objects?
Testing inside a Jupyter notebook was even worse, it was reading something like 1 item/s.