High memory usage with multiple threads
guarin opened this issue · comments
Hi, thanks for the great library!
I noticed that memory continuously keeps growing when using multiple threads and I am not sure I fully understand why that is happening.
I used the following code for testing:
from diskcache import Cache
import tempfile
import psutil
import os
from multiprocessing.pool import ThreadPool
proc = psutil.Process(os.getpid())
with tempfile.TemporaryDirectory() as tempdir:
cache = Cache(
directory=tempdir,
size_limit=10 * (1024**3),
)
@cache.memoize()
def fun(key: int):
return [key] * 1000
# Create keys and repeat them a few times
keys = list(range(10_000)) * 10
max_memory = 0
with ThreadPool(processes=10) as pool:
# Iterate over the keys and call the function
for i, _ in enumerate(pool.imap(fun, keys)):
if i % 10 == 0:
# Log memory usage
memory = proc.memory_info().rss / (1024**3)
max_memory = max(memory, max_memory)
volume = cache.volume() / (1024**3)
print(
f"{i:>6} mem={memory:>6.2f} GB maxmem={max_memory:>6.2f} GB cache={volume:>6.2f} GB"
)
With these settings I get to 2.15 GB memory usage even though the cache size is very small:
# output
99990 mem= 0.63 GB maxmem= 2.15 GB cache= 0.04 GB
Using a single threads requires only 0.13 GB:
# output
99990 mem= 0.11 GB maxmem= 0.13 GB cache= 0.04 GB
Is this due to keys/values being cached in memory? And is there a way to limit the amount of used memory?
Keys/values are not cached in memory.
Each thread opens its own connection to the cache (SQLite connection). These connections memory-map chunks of the underlying database and track a lot of other state. The ten threads seem to use 10x the memory of a single thread so I would guess it is that. 🤷♂️
Python has memory profilers. You might try that to narrow down where the memory is being held. I don’t believe the cache memoize() method holds any long references to memory.
If you want to memory-map less then check the settings.
Thanks for the pointers!
I'll try the different sql settings. Once thing I noticed is that calling cache.volume()
too often can result in a considerable memory increase.
Memory profiling shows that most memory comes from pickle.load
, sql(update % update_column.format(now=now), (rowid,))
and rows = self._sql(select, (db_key, raw, time.time())).fetchall()
. I'll investigate some more :)
The multiple DB connections were indeed the reason for the high memory consumption. Reducing sqlite_cache_size
and sqlite_mmap_size
helped reduce memory 👍🏼