Cache.get() ocassionally returns None when writing from multiple threads

Question

Cache.get() ocassionally returns None when writing from multiple threads

avin3sh opened this issue 2 years ago · comments

I have been facing an issue where Cache and FanoutCache - both randomly return None / are getting evicted when writing through multiple threads and reading at the same time.

Here is a basic reproducer where four threads are writing to the Cache and one thread is reading from it

from time import sleep
from threading import Thread
from queue import Queue
from diskcache import Cache

CACHE_DIR = "/home/avinesh2/diskcache-test/cache"
STATUS_CACHE_KEY = "status_cache"
IS_POPULATED = False


def clear_cache() -> None:
    cache = Cache(CACHE_DIR)
    cache.clear()
    cache.close()
    print("Cleared cache")


def get_status_cache() -> dict:
    global IS_POPULATED
    try:
        cache = Cache(CACHE_DIR)

        result = cache.get(STATUS_CACHE_KEY, retry=True)

        if result is None:
            # if cache does not exist, create one
            if IS_POPULATED is False:
                print("Populating status cache")
                cache.set(STATUS_CACHE_KEY, {}, retry=True)
                IS_POPULATED = True
                result = {}
            else:
                raise Exception("Cache is none, unexpected!")

        cache.close()

        return result
    except Exception as ex:
        print(f"Failed to retrieve status cache: {ex}")
        raise


def update_status_cache_entry(status_cache) -> None:
    try:
        cache = Cache(CACHE_DIR)

        all_data = get_status_cache()
        all_data[status_cache.get("username")] = status_cache

        cache[STATUS_CACHE_KEY] = all_data

        cache.close()
    except Exception as ex:
        print(
            f"Failed to update status cache for user {status_cache.get('username')}: {ex}"
        )
        raise


def write_worker(write_queue: Queue):
    """
    Update cache
    """
    while True:
        status = write_queue.get()
        print(f"Writing {status.get('username')}")
        try:
            update_status_cache_entry(status_cache=status)
        except Exception as ex:
            print(
                f"An exception encounted when processing write queue for ({status.get('username')}): {ex}"
            )
            raise

        write_queue.task_done()


def read_worker(read_queue: Queue):
    """
    Read cache
    """
    while True:
        username = (read_queue.get()).get("username")
        print("Reading")
        try:
            get_status_cache()
        except Exception as ex:
            print(
                f"An exception encounted when processing read queue for ({username}): {ex}"
            )
            raise

        read_queue.task_done()


if __name__ == "__main__":
    # Making sure we start from clean state
    clear_cache()

    # Hydrating our queues
    write_q = Queue()
    read_q = Queue()
    for i in range(10000):
        data = {"username": f"user{i}", "status": "active"}
        write_q.put(data)
        read_q.put(data)
    print("Hydrated queues")

    # Start writer workers
    for _ in range(4):
        wo = Thread(target=write_worker, args=(write_q,))
        wo.setDaemon(True)
        wo.start()
    print("Started write workers")

    sleep(2)

    # Start reader worker (only one)
    for _ in range(1):
        wo3 = Thread(target=read_worker, args=(read_q,))
        wo3.setDaemon(True)
        wo3.start()
    print("Started read workers")

    write_q.join()
    print("Writes done")
    read_q.join()
    print("reads done")

When you run the script, you will see an output like this,

...
Writing user409
Writing user410
Reading
Writing user411
Writing user412
Reading
Writing user413
Reading
...

However after some time, in my case after around 1000 entries have been written by writer threads, you will get this exception which is raised when cache.get() returns None

...
Writing user1011
Reading
Writing user1012
Failed to retrieve status cache: Cache is none, unexpected!
Failed to update status cache for user user1012: Cache is none, unexpected!
An exception encounted when processing write queue for (user1012): Cache is none, unexpected!
Exception in thread Thread-3 (write_worker):
Traceback (most recent call last):
  File "/usr/lib64/python3.10/threading.py", line 1009, in _bootstrap_inner
Reading
    self.run()
  File "/usr/lib64/python3.10/threading.py", line 946, in run
    self._target(*self._args, **self._kwargs)
Writing user1013
  File "/home/avinesh2/diskcache-test/test.py", line 74, in write_worker
    update_status_cache_entry(status_cache=status)
  File "/home/avinesh2/diskcache-test/test.py", line 55, in update_status_cache_entry
    all_data = get_status_cache()
  File "/home/avinesh2/diskcache-test/test.py", line 41, in get_status_cache
    raise Exception("Cache is none, unexpected!")
Exception: Cache is none, unexpected!
...

My original understanding was that I should use FanoutCache in such scenarios, however this happens even when you replace all the instances of Cache with FanoutCache. The code isn't intentionally optimized to demonstrate the possible problem.

The interesting part is that the data isn't lost, if you retry again after getting a None response - you will get entire data back.

Grant Jenks · Answer 1 · Wed Jun 01 2022 04:43:58 GMT+0800 (China Standard Time)

Reduced repro:

from threading import Thread
from queue import Queue
from diskcache import Cache

ERROR = False
CACHE_DIR = "/tmp/cache"
STATUS_CACHE_KEY = "status_cache"


def update_status_cache_entry(status_cache) -> None:
    try:
        all_data = cache.get(STATUS_CACHE_KEY)
        all_data[status_cache.get("username")] = status_cache
        success = cache.set(STATUS_CACHE_KEY, all_data, retry=True)
        if not success:
            raise Exception('Failed to set status key!')
    except Exception as ex:
        print(f"Failed to update status cache for {status_cache}: {ex}")
        raise


def write_worker(write_queue: Queue):
    """
    Update cache
    """
    global ERROR
    while not ERROR and not write_queue.empty():
        status = write_queue.get()
        print(f"Writing {status}")
        try:
            update_status_cache_entry(status_cache=status)
        except Exception as ex:
            ERROR = True
            raise
        write_queue.task_done()


if __name__ == "__main__":
    cache = Cache(CACHE_DIR, disk_min_file_size=2**32)
    cache.clear()
    cache[STATUS_CACHE_KEY] = {}

    # Hydrating our queues
    write_q = Queue()

    for i in range(10000):
        data = {"username": f"user{i}", "status": "active"}
        write_q.put(data)

    writers = []

    for _ in range(4):
        writer = Thread(target=write_worker, args=(write_q,))
        writer.start()
        writers.append(writer)

    for writer in writers:
        writer.join()

    if not write_q.empty():
        print('ERROR!')

    cache.close()

The issue here is with the value that is written back to the cache. In the repro, the cache value is getting larger and larger. After it crosses a threshold, the cache value (when serialized to bytes with Pickle) is spilled to a file. Once the value is spilled to disk, the update process becomes: read value from file, delete file, write value to new file. The update process now creates a race condition between threads as one thread may attempt to read a file that has just been deleted by another thread. When the file cannot be read, the get() operation returns None.

One workaround is to force the value to be stored in the cache by increasing the disk_min_file_size like cache = Cache(CACHE_DIR, disk_min_file_size=2**32).

Does the repro really work the way you want? The code repeatedly reads a key/value from the cache, makes the value a bit larger, and then writes the key/value back to the cache. Since the value is itself a dictionary, it would be much more efficient to break the dictionary into individual items stored directly in the cache. Something like:

statuses = Cache('/tmp/cache/status')

for username in usernames:
    statuses[username] = 'active'

Also notice in the reduced repro -- the cache is opened once and closed once. That's a lot more efficient! All threads can share the same cache reference.

Avinesh Singh · Answer 2 · Thu Jun 02 2022 00:47:32 GMT+0800 (China Standard Time)

Once the value is spilled to disk, the update process becomes: read value from file, delete file, write value to new file. The update process now creates a race condition between threads as one thread may attempt to read a file that has just been deleted by another thread. When the file cannot be read, the get() operation returns None.

This makes sense. Thank you for explaining this.

Overriding min_file_size disk constant did help in my case - I don't run into this issue anymore.

I agree the dict in my repro isn't the best since it can be done more efficiently anyway, but there are legitimate cases of reading the value, adding something to it and writing it back. There is not a lot in the docs about this disk_min_file_size, do you want me to create a PR to include some kind of the note in the doc so that this behavior is more obvious to someone ?

Grant Jenks · Answer 3 · Thu Jun 02 2022 01:35:31 GMT+0800 (China Standard Time)

there are legitimate cases of reading the value, adding something to it and writing it back

Sure, I can imagine wanting to do that but the read/edit/write steps are not atomic. If they were wrapped in a transaction then it would be atomic, like:

with cache.transact():
    value = cache.get('foo')
    value += 1
    cache.set(key, value)

Then, regardless of the value's size, it would be atomic. Note though that if the value gets really large and reading/writing take a long time, then performance will be poor (but it will still be atomic/correct). Atomic & fast is difficult but can sometimes be done, possibly with a change to the data type.

There is not a lot in the docs about this disk_min_file_size, do you want me to create a PR to include some kind of the note in the doc so that this behavior is more obvious to someone ?

It's in the Tutorial under Settings: https://grantjenks.com/docs/diskcache/tutorial.html#settings If you want to add more details to the disk_min_file_size description then that's fine.

And there's a whole section on Transactions: https://grantjenks.com/docs/diskcache/tutorial.html#transactions