[Feature Request] memory issue / make Run more efficient

Question

[Feature Request] memory issue / make Run more efficient

PaulLerner opened this issue a year ago · comments

Hi Elias,

Is your feature request related to a problem? Please describe.
I've noticed that Run (and I guess also Qrels) consume a lot of memory (RAM) compared to standard python dict, e.g. a few GB instead of a few 100s of MB. This gets problematic for somewhat large datasets (e.g. 1M queries)

Describe the solution you'd like
I guess it's related to Numba representation? I've no clue on how to make it more efficient, sorry :)

Reproduce
Just open your system monitor and see how the memory grows.

In [1]: import ranx
# this weighs only a few 100s of MB
In [2]: run_d = {str(i): {str(j): 0.0 for j in range(100)} for i in range(100000)}
# this grows to a few GB
In [3]: run_r = ranx.Run(run_d)

Best,

Paul

Elias Bassani · Answer 1 · Mon Jul 10 2023 20:09:59 GMT+0800 (China Standard Time)

Dear Paul,

The issue is probably due to numba (I cannot do anything about that) and a forced conversion of the strings used as ids to numba.types.unicode_type that I introduced to avoid errors when I implemented the fusion algorithms.

I have tested a snippet similar to yours (I do not save the Python dict in memory) with and without the conversion.
Memory usage went down from 2.42 GB to 1.59 GB (including ranx import).
The Python dict alone is around 1.15 GB.

I'll try to remove the forced conversion without breaking the fusion algorithms and get back to you.

Thanks for pointing it out!

Elias Bassani · Answer 2 · Wed Jul 19 2023 01:11:37 GMT+0800 (China Standard Time)

Fixed in v0.3.15.