Cachy can't handly concurrent access, should use atomic cache creation
mjpieters opened this issue · comments
We use nitpick, project that lints linter and formatter configurations (a meta linter, if you will). This project in turn uses cachy to handle caching of remote style definitions.
Because the cache is shared among multiple CI jobs, we occasionally see issues with cache files not being found
File "/.../lib/python3.7/site-packages/cachy/repository.py", line 47, in get
val = self._store.get(key)
File "/.../lib/python3.7/site-packages/cachy/stores/file_store.py", line 43, in get
return self._get_payload(key).get('data')
File "/.../lib/python3.7/site-packages/cachy/stores/file_store.py", line 62, in _get_payload
with open(path, 'rb') as fh:
FileNotFoundError: [Errno 2] No such file or directory: '/.../.cache/nitpick/e1/dd/73/42/06/f6/97/29/e1dd734206f69729420b28a80c585edda296012ef36e3ac981749fa076c41975'
because another process has unlinked the file between the os.path.exists()
test and the open()
call, or the cache file being empty:
File "/.../lib/python3.7/site-packages/cachy/repository.py", line 47, in get
val = self._store.get(key)
File "/.../lib/python3.7/site-packages/cachy/stores/file_store.py", line 43, in get
return self._get_payload(key).get('data')
File "/.../lib/python3.7/site-packages/cachy/stores/file_store.py", line 65, in _get_payload
expire = int(contents[:10])
ValueError: invalid literal for int() with base 10: b''
contents
is the data read from the cache file, and contents[:10]
is an empty bytes string.
cachy should at the very least be resilient to the cache files missing or being empty here and not fail with an exception.
However, these errors only happen because another process is currently writing to the cache file (in cachy.stores.file_store.FileStore.put()
). If cachy were to use a temporary file to write the cache to and then move that file into place, provided the temporary file and the cache store location were on the same filesystem, the operation would be atomic and therefore ._get_payload()
would not have to deal with an empty file here.
It probably should use some kind of locking mechanism to coordinate between multiple processes trying to write to the cache.
The following script forces the issue by using multiple concurrent subprocesses to write to the cache:
#!/usr/bin/env python3
"""Cachy file store stress test
Usage: cachy_stresstest [process-count]
Starts process-count subprocesses (defaults to 5), all storing a random blob of
data on cache miss, all with the expiration set to 100 milliseconds. Prints out
worker names (capital ASCII letters), in green if a cached object was
successfully fetched or updated, red if an exception occurred. Processes exit
after 1 minute.
Exceptions are printed out as they happen. Concurrent writing to the terminal
_may_ cause colours to be mixed up.
"""
import random
import string
import subprocess
import sys
import time
from tempfile import TemporaryDirectory
from typing import NoReturn
from cachy import CacheManager
size = 51200 # 5KB, larger than io.DEFAULT_BUFFER_SIZE
# Cachy accepts a value in minutes but doesn't prohibit float values
expiration = 0.5 / 60 # half a second
def worker(name: str, cache_dir: str) -> NoReturn:
store = CacheManager(
{"stores": {"file": {"driver": "file", "path": cache_dir}}}
).store()
blob = bytes(random.choices(range(256), k=size))
key = "cache_key"
while True:
cached, status = None, 92 # bright green
try:
cached = store.get(key)
except Exception as e:
status = 91 # bright red
print(f"\n\x1b[1m{name}: {e.__class__.__name__}: {e}\x1b[m", flush=True)
if cached is None:
store.put(key, blob, expiration)
print(f"\x1b[{status}m{name}\x1b[m", end="", flush=True)
time.sleep(random.uniform(0.01, 0.25))
def main(count):
print(f"Starting {count} processes to stress-test cache access")
with TemporaryDirectory() as tempdir:
procs = [
subprocess.Popen([sys.executable, __file__, "worker", name, tempdir])
for name in string.ascii_uppercase[:count]
]
try:
time.sleep(60)
except KeyboardInterrupt:
pass
for proc in procs:
proc.kill()
if __name__ == "__main__":
if len(sys.argv) > 3 and sys.argv[1] == "worker":
worker(*sys.argv[2:4])
main(int(sys.argv[1]) if len(sys.argv) > 1 else 5)
It'll hit both issues quite quickly, more so if you use more than the default 5 processes.