High memory usage with multiple threads

Question

High memory usage with multiple threads

guarin opened this issue a year ago · comments

Hi, thanks for the great library!

I noticed that memory continuously keeps growing when using multiple threads and I am not sure I fully understand why that is happening.

I used the following code for testing:

from diskcache import Cache
import tempfile
import psutil
import os
from multiprocessing.pool import ThreadPool

proc = psutil.Process(os.getpid())

with tempfile.TemporaryDirectory() as tempdir:
    cache = Cache(
        directory=tempdir,
        size_limit=10 * (1024**3),
    )

    @cache.memoize()
    def fun(key: int):
        return [key] * 1000

    # Create keys and repeat them a few times
    keys = list(range(10_000)) * 10
    max_memory = 0
    with ThreadPool(processes=10) as pool:
        # Iterate over the keys and call the function
        for i, _ in enumerate(pool.imap(fun, keys)):
            if i % 10 == 0:
                # Log memory usage
                memory = proc.memory_info().rss / (1024**3)
                max_memory = max(memory, max_memory)
                volume = cache.volume() / (1024**3)
                print(
                    f"{i:>6} mem={memory:>6.2f} GB maxmem={max_memory:>6.2f} GB cache={volume:>6.2f} GB"
                )

With these settings I get to 2.15 GB memory usage even though the cache size is very small:

# output
 99990 mem=  0.63 GB maxmem=  2.15 GB cache=  0.04 GB

Using a single threads requires only 0.13 GB:

# output
 99990 mem=  0.11 GB maxmem=  0.13 GB cache=  0.04 GB

Is this due to keys/values being cached in memory? And is there a way to limit the amount of used memory?

Grant Jenks · Answer 1 · Wed Oct 18 2023 00:08:30 GMT+0800 (China Standard Time)

Keys/values are not cached in memory.

Each thread opens its own connection to the cache (SQLite connection). These connections memory-map chunks of the underlying database and track a lot of other state. The ten threads seem to use 10x the memory of a single thread so I would guess it is that. 🤷‍♂️

Python has memory profilers. You might try that to narrow down where the memory is being held. I don’t believe the cache memoize() method holds any long references to memory.

If you want to memory-map less then check the settings.

guarin · Answer 2 · Thu Oct 19 2023 22:50:55 GMT+0800 (China Standard Time)

Thanks for the pointers!

I'll try the different sql settings. Once thing I noticed is that calling cache.volume() too often can result in a considerable memory increase.

Memory profiling shows that most memory comes from pickle.load, sql(update % update_column.format(now=now), (rowid,)) and rows = self._sql(select, (db_key, raw, time.time())).fetchall(). I'll investigate some more :)

guarin · Answer 3 · Mon Oct 30 2023 18:28:28 GMT+0800 (China Standard Time)

The multiple DB connections were indeed the reason for the high memory consumption. Reducing sqlite_cache_size and sqlite_mmap_size helped reduce memory 👍🏼