CAVaccineInventory / vial

The Django application powering calltheshots.us

Home Page:https://vial.calltheshots.us

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

api.vaccinatethestates.com data inconsistent while the export is running

simonw opened this issue · comments

I spotted a nasty edge-case here: while the script was running curl https://api.vaccinatethestates.com/v0/locations.json return an empty JSON file - and continued to do so after the script had completed, presumably due to caching - but returned the full file if I added a ? to the end: curl https://api.vaccinatethestates.com/v0/locations.json?

Having done that, a hit to the regular URL does now return the 30MB JSON file as expected.

Originally posted by @simonw in #658 (comment)

Here's the code that writes to the bucket:

@beeline.traced(name="core.exporter.storage.GoogleStorageWriter.write")
def write(self, path: str, content_stream: Iterator[bytes]) -> None:
if self.prefix:
path = self.prefix + "/" + path
blob = self.get_bucket().blob(path)
blob.cache_control = "public,max-age=120"
blob.content_encoding = "gzip"
with blob.open("wb") as f:
with gzip.open(f, "w") as gzip_f:
for chunk in content_stream:
gzip_f.write(chunk)
- it should be affecting our other Google Cloud Storage exports too.

And here's that .write() method:

@cache
@beeline.traced(name="core.exporter.storage.GoogleStorageWriter.get_bucket")
def get_bucket(self) -> storage.Bucket:
storage_client = storage.Client()
return storage_client.bucket(self.bucket_name)
@beeline.traced(name="core.exporter.storage.GoogleStorageWriter.write")
def write(self, path: str, content_stream: Iterator[bytes]) -> None:
if self.prefix:
path = self.prefix + "/" + path
blob = self.get_bucket().blob(path)
blob.cache_control = "public,max-age=120"
blob.content_encoding = "gzip"
with blob.open("wb") as f:
with gzip.open(f, "w") as gzip_f:
for chunk in content_stream:
gzip_f.write(chunk)

Here's the API documentation for the Google Storage Python library that uses: https://googleapis.dev/python/storage/latest/blobs.html#google.cloud.storage.blob.Blob.open

Looking at the code for the Google Python client library:

https://github.com/googleapis/python-storage/blob/78b2eba81003b437cd24f2b8d269ea2455682507/google/cloud/storage/fileio.py#L287-L319

When the file-like object is closed this code runs:

    def close(self):
        self._checkClosed()  # Raises ValueError if closed.
        self._upload_chunks_from_buffer(1)
        self._buffer.close()

self._buffer is a io.BytesIO() - so it really does look like there's no attempt to say "Hey Google, this upload is now complete, you should make it visible" - which enforces my hunch that the data becomes partially available as the upload runs.

Maybe the blob.upload_from_file() method (which accepts a file-like object) would handle this atomic-ness of updates for us? https://github.com/googleapis/python-storage/blob/78b2eba81003b437cd24f2b8d269ea2455682507/google/cloud/storage/blob.py#L2175

I tried to prove to myself that there really is a situation where GCP can return a half-written file.

To do that, I ran the following code in a Jupyter notebook:

import httpx, random, time, datetime

def checkit():
    url = "https://staging-api.vaccinatethestates.com/v0/locations.json"
    url += "?" + str(random.random())
    response = httpx.head(url)
    return response.headers["content-length"], response.headers["etag"]

while True:
    print(datetime.datetime.now(), checkit())
    time.sleep(2)

Then I ran that code while hitting the POST https://vial-staging.calltheshots.us/api/exportVaccinateTheStates endpoint to see if I could spot a moment when the file was being partially written.

Here's the output from the interesting moment when the file changed:

2021-06-10 15:58:25.696015 ('5876138', '"7a6865abbe3ead48bcade38ce3dd8b2f"')
2021-06-10 15:58:27.784681 ('5876138', '"7a6865abbe3ead48bcade38ce3dd8b2f"')
2021-06-10 15:58:29.868349 ('5876138', '"7a6865abbe3ead48bcade38ce3dd8b2f"')
2021-06-10 15:58:31.968002 ('5876138', '"7a6865abbe3ead48bcade38ce3dd8b2f"')
2021-06-10 15:58:34.217241 ('5876148', '"adc2ec7442af24ab279540464d2e5ddf"')
2021-06-10 15:58:36.609763 ('5876138', '"7a6865abbe3ead48bcade38ce3dd8b2f"')
2021-06-10 15:58:38.692306 ('5876148', '"adc2ec7442af24ab279540464d2e5ddf"')
2021-06-10 15:58:40.772465 ('5876138', '"7a6865abbe3ead48bcade38ce3dd8b2f"')
2021-06-10 15:58:42.858199 ('5876148', '"adc2ec7442af24ab279540464d2e5ddf"')
2021-06-10 15:58:44.971442 ('5876148', '"adc2ec7442af24ab279540464d2e5ddf"')
2021-06-10 15:58:47.054310 ('5876148', '"adc2ec7442af24ab279540464d2e5ddf"')
2021-06-10 15:58:49.152108 ('5876148', '"adc2ec7442af24ab279540464d2e5ddf"')
2021-06-10 15:58:51.235609 ('5876138', '"7a6865abbe3ead48bcade38ce3dd8b2f"')
2021-06-10 15:58:53.333125 ('5876138', '"7a6865abbe3ead48bcade38ce3dd8b2f"')
2021-06-10 15:58:55.403930 ('5876148', '"adc2ec7442af24ab279540464d2e5ddf"')
2021-06-10 15:58:57.559205 ('5876148', '"adc2ec7442af24ab279540464d2e5ddf"')
2021-06-10 15:58:59.639862 ('5876148', '"adc2ec7442af24ab279540464d2e5ddf"')
2021-06-10 15:59:01.707824 ('5876148', '"adc2ec7442af24ab279540464d2e5ddf"')

Each of these hits is to a cache-busting https://staging-api.vaccinatethestates.com/v0/locations.json?random URL.

Note that the new version of the file -which is a good size, so it's not partially written - first becomes available at 2021-06-10 15:58:34.217241 - but then we continue to randomly get back either the old file or the new file until the last appearance of the old file at 2021-06-10 15:58:53.333125 - nearly 20 seconds later.

But... I couldn't replicate the 0 bytes JSON file bug that caused me to open that ticket - which makes me suspect that it was actually the result of the byte encoding bug from #658 (comment) which I fixed in 6cdbf73

So I'm going to assume this issue is invalid, unless we spot it again in the future.