Cache.get() ocassionally returns None when writing from multiple threads
avin3sh opened this issue · comments
I have been facing an issue where Cache
and FanoutCache
- both randomly return None
/ are getting evicted when writing through multiple threads and reading at the same time.
Here is a basic reproducer where four threads are writing to the Cache and one thread is reading from it
from time import sleep
from threading import Thread
from queue import Queue
from diskcache import Cache
CACHE_DIR = "/home/avinesh2/diskcache-test/cache"
STATUS_CACHE_KEY = "status_cache"
IS_POPULATED = False
def clear_cache() -> None:
cache = Cache(CACHE_DIR)
cache.clear()
cache.close()
print("Cleared cache")
def get_status_cache() -> dict:
global IS_POPULATED
try:
cache = Cache(CACHE_DIR)
result = cache.get(STATUS_CACHE_KEY, retry=True)
if result is None:
# if cache does not exist, create one
if IS_POPULATED is False:
print("Populating status cache")
cache.set(STATUS_CACHE_KEY, {}, retry=True)
IS_POPULATED = True
result = {}
else:
raise Exception("Cache is none, unexpected!")
cache.close()
return result
except Exception as ex:
print(f"Failed to retrieve status cache: {ex}")
raise
def update_status_cache_entry(status_cache) -> None:
try:
cache = Cache(CACHE_DIR)
all_data = get_status_cache()
all_data[status_cache.get("username")] = status_cache
cache[STATUS_CACHE_KEY] = all_data
cache.close()
except Exception as ex:
print(
f"Failed to update status cache for user {status_cache.get('username')}: {ex}"
)
raise
def write_worker(write_queue: Queue):
"""
Update cache
"""
while True:
status = write_queue.get()
print(f"Writing {status.get('username')}")
try:
update_status_cache_entry(status_cache=status)
except Exception as ex:
print(
f"An exception encounted when processing write queue for ({status.get('username')}): {ex}"
)
raise
write_queue.task_done()
def read_worker(read_queue: Queue):
"""
Read cache
"""
while True:
username = (read_queue.get()).get("username")
print("Reading")
try:
get_status_cache()
except Exception as ex:
print(
f"An exception encounted when processing read queue for ({username}): {ex}"
)
raise
read_queue.task_done()
if __name__ == "__main__":
# Making sure we start from clean state
clear_cache()
# Hydrating our queues
write_q = Queue()
read_q = Queue()
for i in range(10000):
data = {"username": f"user{i}", "status": "active"}
write_q.put(data)
read_q.put(data)
print("Hydrated queues")
# Start writer workers
for _ in range(4):
wo = Thread(target=write_worker, args=(write_q,))
wo.setDaemon(True)
wo.start()
print("Started write workers")
sleep(2)
# Start reader worker (only one)
for _ in range(1):
wo3 = Thread(target=read_worker, args=(read_q,))
wo3.setDaemon(True)
wo3.start()
print("Started read workers")
write_q.join()
print("Writes done")
read_q.join()
print("reads done")
When you run the script, you will see an output like this,
...
Writing user409
Writing user410
Reading
Writing user411
Writing user412
Reading
Writing user413
Reading
...
However after some time, in my case after around 1000 entries have been written by writer threads, you will get this exception which is raised when cache.get()
returns None
...
Writing user1011
Reading
Writing user1012
Failed to retrieve status cache: Cache is none, unexpected!
Failed to update status cache for user user1012: Cache is none, unexpected!
An exception encounted when processing write queue for (user1012): Cache is none, unexpected!
Exception in thread Thread-3 (write_worker):
Traceback (most recent call last):
File "/usr/lib64/python3.10/threading.py", line 1009, in _bootstrap_inner
Reading
self.run()
File "/usr/lib64/python3.10/threading.py", line 946, in run
self._target(*self._args, **self._kwargs)
Writing user1013
File "/home/avinesh2/diskcache-test/test.py", line 74, in write_worker
update_status_cache_entry(status_cache=status)
File "/home/avinesh2/diskcache-test/test.py", line 55, in update_status_cache_entry
all_data = get_status_cache()
File "/home/avinesh2/diskcache-test/test.py", line 41, in get_status_cache
raise Exception("Cache is none, unexpected!")
Exception: Cache is none, unexpected!
...
My original understanding was that I should use FanoutCache
in such scenarios, however this happens even when you replace all the instances of Cache
with FanoutCache
. The code isn't intentionally optimized to demonstrate the possible problem.
The interesting part is that the data isn't lost, if you retry again after getting a None
response - you will get entire data back.
Reduced repro:
from threading import Thread
from queue import Queue
from diskcache import Cache
ERROR = False
CACHE_DIR = "/tmp/cache"
STATUS_CACHE_KEY = "status_cache"
def update_status_cache_entry(status_cache) -> None:
try:
all_data = cache.get(STATUS_CACHE_KEY)
all_data[status_cache.get("username")] = status_cache
success = cache.set(STATUS_CACHE_KEY, all_data, retry=True)
if not success:
raise Exception('Failed to set status key!')
except Exception as ex:
print(f"Failed to update status cache for {status_cache}: {ex}")
raise
def write_worker(write_queue: Queue):
"""
Update cache
"""
global ERROR
while not ERROR and not write_queue.empty():
status = write_queue.get()
print(f"Writing {status}")
try:
update_status_cache_entry(status_cache=status)
except Exception as ex:
ERROR = True
raise
write_queue.task_done()
if __name__ == "__main__":
cache = Cache(CACHE_DIR, disk_min_file_size=2**32)
cache.clear()
cache[STATUS_CACHE_KEY] = {}
# Hydrating our queues
write_q = Queue()
for i in range(10000):
data = {"username": f"user{i}", "status": "active"}
write_q.put(data)
writers = []
for _ in range(4):
writer = Thread(target=write_worker, args=(write_q,))
writer.start()
writers.append(writer)
for writer in writers:
writer.join()
if not write_q.empty():
print('ERROR!')
cache.close()
The issue here is with the value that is written back to the cache. In the repro, the cache value is getting larger and larger. After it crosses a threshold, the cache value (when serialized to bytes with Pickle) is spilled to a file. Once the value is spilled to disk, the update process becomes: read value from file, delete file, write value to new file. The update process now creates a race condition between threads as one thread may attempt to read a file that has just been deleted by another thread. When the file cannot be read, the get() operation returns None
.
One workaround is to force the value to be stored in the cache by increasing the disk_min_file_size
like cache = Cache(CACHE_DIR, disk_min_file_size=2**32)
.
Does the repro really work the way you want? The code repeatedly reads a key/value from the cache, makes the value a bit larger, and then writes the key/value back to the cache. Since the value is itself a dictionary, it would be much more efficient to break the dictionary into individual items stored directly in the cache. Something like:
statuses = Cache('/tmp/cache/status')
for username in usernames:
statuses[username] = 'active'
Also notice in the reduced repro -- the cache is opened once and closed once. That's a lot more efficient! All threads can share the same cache reference.
Once the value is spilled to disk, the update process becomes: read value from file, delete file, write value to new file. The update process now creates a race condition between threads as one thread may attempt to read a file that has just been deleted by another thread. When the file cannot be read, the get() operation returns None.
This makes sense. Thank you for explaining this.
Overriding min_file_size
disk constant did help in my case - I don't run into this issue anymore.
I agree the dict in my repro isn't the best since it can be done more efficiently anyway, but there are legitimate cases of reading the value, adding something to it and writing it back. There is not a lot in the docs about this disk_min_file_size
, do you want me to create a PR to include some kind of the note in the doc so that this behavior is more obvious to someone ?
there are legitimate cases of reading the value, adding something to it and writing it back
Sure, I can imagine wanting to do that but the read/edit/write steps are not atomic. If they were wrapped in a transaction then it would be atomic, like:
with cache.transact():
value = cache.get('foo')
value += 1
cache.set(key, value)
Then, regardless of the value's size, it would be atomic. Note though that if the value gets really large and reading/writing take a long time, then performance will be poor (but it will still be atomic/correct). Atomic & fast is difficult but can sometimes be done, possibly with a change to the data type.
There is not a lot in the docs about this disk_min_file_size, do you want me to create a PR to include some kind of the note in the doc so that this behavior is more obvious to someone ?
It's in the Tutorial under Settings: https://grantjenks.com/docs/diskcache/tutorial.html#settings If you want to add more details to the disk_min_file_size
description then that's fine.
And there's a whole section on Transactions: https://grantjenks.com/docs/diskcache/tutorial.html#transactions