api.vaccinatethestates.com data inconsistent while the export is running
simonw opened this issue · comments
I spotted a nasty edge-case here: while the script was running
curl https://api.vaccinatethestates.com/v0/locations.json
return an empty JSON file - and continued to do so after the script had completed, presumably due to caching - but returned the full file if I added a?
to the end:curl https://api.vaccinatethestates.com/v0/locations.json?
Having done that, a hit to the regular URL does now return the 30MB JSON file as expected.
Originally posted by @simonw in #658 (comment)
Here's the code that writes to the bucket:
vial/vaccinate/core/exporter/storage.py
Lines 61 to 72 in 93fe234
And here's that .write()
method:
vial/vaccinate/core/exporter/storage.py
Lines 55 to 72 in 93fe234
Here's the API documentation for the Google Storage Python library that uses: https://googleapis.dev/python/storage/latest/blobs.html#google.cloud.storage.blob.Blob.open
Looking at the code for the Google Python client library:
When the file-like object is closed this code runs:
def close(self):
self._checkClosed() # Raises ValueError if closed.
self._upload_chunks_from_buffer(1)
self._buffer.close()
self._buffer
is a io.BytesIO()
- so it really does look like there's no attempt to say "Hey Google, this upload is now complete, you should make it visible" - which enforces my hunch that the data becomes partially available as the upload runs.
Maybe the blob.upload_from_file()
method (which accepts a file-like object) would handle this atomic-ness of updates for us? https://github.com/googleapis/python-storage/blob/78b2eba81003b437cd24f2b8d269ea2455682507/google/cloud/storage/blob.py#L2175
I tried to prove to myself that there really is a situation where GCP can return a half-written file.
To do that, I ran the following code in a Jupyter notebook:
import httpx, random, time, datetime
def checkit():
url = "https://staging-api.vaccinatethestates.com/v0/locations.json"
url += "?" + str(random.random())
response = httpx.head(url)
return response.headers["content-length"], response.headers["etag"]
while True:
print(datetime.datetime.now(), checkit())
time.sleep(2)
Then I ran that code while hitting the POST https://vial-staging.calltheshots.us/api/exportVaccinateTheStates
endpoint to see if I could spot a moment when the file was being partially written.
Here's the output from the interesting moment when the file changed:
2021-06-10 15:58:25.696015 ('5876138', '"7a6865abbe3ead48bcade38ce3dd8b2f"')
2021-06-10 15:58:27.784681 ('5876138', '"7a6865abbe3ead48bcade38ce3dd8b2f"')
2021-06-10 15:58:29.868349 ('5876138', '"7a6865abbe3ead48bcade38ce3dd8b2f"')
2021-06-10 15:58:31.968002 ('5876138', '"7a6865abbe3ead48bcade38ce3dd8b2f"')
2021-06-10 15:58:34.217241 ('5876148', '"adc2ec7442af24ab279540464d2e5ddf"')
2021-06-10 15:58:36.609763 ('5876138', '"7a6865abbe3ead48bcade38ce3dd8b2f"')
2021-06-10 15:58:38.692306 ('5876148', '"adc2ec7442af24ab279540464d2e5ddf"')
2021-06-10 15:58:40.772465 ('5876138', '"7a6865abbe3ead48bcade38ce3dd8b2f"')
2021-06-10 15:58:42.858199 ('5876148', '"adc2ec7442af24ab279540464d2e5ddf"')
2021-06-10 15:58:44.971442 ('5876148', '"adc2ec7442af24ab279540464d2e5ddf"')
2021-06-10 15:58:47.054310 ('5876148', '"adc2ec7442af24ab279540464d2e5ddf"')
2021-06-10 15:58:49.152108 ('5876148', '"adc2ec7442af24ab279540464d2e5ddf"')
2021-06-10 15:58:51.235609 ('5876138', '"7a6865abbe3ead48bcade38ce3dd8b2f"')
2021-06-10 15:58:53.333125 ('5876138', '"7a6865abbe3ead48bcade38ce3dd8b2f"')
2021-06-10 15:58:55.403930 ('5876148', '"adc2ec7442af24ab279540464d2e5ddf"')
2021-06-10 15:58:57.559205 ('5876148', '"adc2ec7442af24ab279540464d2e5ddf"')
2021-06-10 15:58:59.639862 ('5876148', '"adc2ec7442af24ab279540464d2e5ddf"')
2021-06-10 15:59:01.707824 ('5876148', '"adc2ec7442af24ab279540464d2e5ddf"')
Each of these hits is to a cache-busting https://staging-api.vaccinatethestates.com/v0/locations.json?random
URL.
Note that the new version of the file -which is a good size, so it's not partially written - first becomes available at 2021-06-10 15:58:34.217241
- but then we continue to randomly get back either the old file or the new file until the last appearance of the old file at 2021-06-10 15:58:53.333125
- nearly 20 seconds later.
But... I couldn't replicate the 0 bytes JSON file bug that caused me to open that ticket - which makes me suspect that it was actually the result of the byte encoding bug from #658 (comment) which I fixed in 6cdbf73
So I'm going to assume this issue is invalid, unless we spot it again in the future.