counties.csv hasn't updated in 3 days

Question

counties.csv hasn't updated in 3 days

mjwebster opened this issue 2 years ago · comments

Describe the issue:

Last update on the counties.csv file seems to be 5/13/22, which is 3 days ago. Other files appear to have been updated earlier this morning.

Tiff Fehr · Answer 1 · Tue May 17 2022 06:37:12 GMT+0800 (China Standard Time)

Howdy, @mjwebster! Fan of your work. 😄
We've run into a limit with GitHub's raw file uploads.

This file is now updating. For anyone following along, the original us-counties.csv file is now almost at or over the Github file limit and will soon stop updating.
Originally posted by @albertsun in #674 (comment)

We're recommending people use the year-based county files from now on.

MaryJo Webster · Answer 2 · Tue May 17 2022 08:06:44 GMT+0800 (China Standard Time)

Thank you @tiffehr ! I was wondering if you guys might hit that at some point. It certainly is a lot of data.

Brian Klahn · Answer 3 · Fri May 20 2022 21:20:34 GMT+0800 (China Standard Time)

Thank you to @tiffehr , and everyone here, for maintaining this data source for people!!!

Yes . . . I wondered, when that file size got to exactly 100 MB . . . :-)
-and I did see the clear explanation in the README, about this, before chiming in here.

I know that for a lot of folks it breaks their "API" to have that need to change to us-counties-2022.csv (e.g.)
Ideally, there are better ways to store snapshot-like data, than in text files (e.g. csv).
I don't think people can assume there is any promise of a consistent "API", here. But the change sneaks up on people, where the exact same file name, now kinda means a different data asset/expectation.

I wonder if it would mitigate things for folks if, say, us-counties.csv became a symlink to the latest year csv.
(Probably doing a git pull, on a local repo copy, is less bandwidth-intensive than always streaming down the full, uncomplressed, github raw version of that file. Git compacts things before any transfer operations, and only changes should be sent.)

Something like . . .

git mv us-counties.csv us-counties-full-legacy.csv
ln -s us-counties-2022.csv us-counties.csv
git add us-counties.csv
git commit
git push

(might need/want two commit steps, due to same name for tracked old csv and new symlink)

Even if the old file is "deleted" with a git rm (to make way for a symlink of the same name), it is still easily accessible via the git history, if needed.

Albert Sun · Answer 4 · Fri May 20 2022 22:45:59 GMT+0800 (China Standard Time)

@bdklahn we wouldn't do that because the new file is not of the same format as it does not contain the whole history of the file.

Unfortunately we think in this case it's best for people to manually see the change and update any processes they are running using the data to use the new format.

Brian Klahn · Answer 5 · Fri May 20 2022 23:22:09 GMT+0800 (China Standard Time)

I understand.
But the update already changes the fundamental format of that file to no longer contain the whole history to date.
So, I wondered, since the fundamental format was already changed, if another version of fundamental change might be less disruptive for folks.
If folks need older data, they can always go back to a previous git snapshot to pull that big file, or whatever.

Anyway, I can pretty easily adjust local scripts, etc., to reconstruct what us-counties.csv used to be, if necessary. I just wondered . . .

Simple file names like us-counties.csv, in a regularly updated git repo, (v.s., say, something like us-countiies-YYYY-MM-DD.csv) are often inferred by people to mean "current data".
Anyway all this COVID time series data stuff . . . is hard to snapshot, anyway, given "backfill" updates which change "history", and similar.

So I appreciate any of this data wrangling effort you folks are already doing.

We'll deal with it.

Thank you!

Glenn Willen · Answer 6 · Sun May 29 2022 11:02:00 GMT+0800 (China Standard Time)

It looks like this caused the Google Cloud "New York Times US Coronavirus Database" dataset to stop updating at the same time: https://console.cloud.google.com/marketplace/product/the-new-york-times/covid19_us_cases

That page says it's based on the data in this repo, but I don't know who maintains it -- is it possible to get it updating again? A tool I use, covid-19.direct, has stopped updating in turn from the Google Cloud dataset.

Thanks!

Tiff Fehr · Answer 7 · Tue May 31 2022 23:01:56 GMT+0800 (China Standard Time)

@gwillen We don't know who owns that (but it's fun to see it listed and use it in BigQuery's SQL workspace).

Update: based on the Marketplace page, looks like Google runs it in order to promote BQ.

https://mail.google.com/mail/u/0/?view=cm&fs=1&to=public-data-help@google.com&su=Public%20Datasets%20Issue:%20[INSERT%20ISSUE%20SUBJECT%20HERE]....

I'll ping that email and see who responds.

Glenn Willen · Answer 8 · Wed Jun 01 2022 00:50:38 GMT+0800 (China Standard Time)

@gwillen We don't know who owns that (but it's fun to see it listed and use it in BigQuery's SQL workspace).

Update: based on the Marketplace page, looks like Google runs it in order to promote BQ.
https://mail.google.com/mail/u/0/?view=cm&fs=1&to=public-data-help@google.com&su=Public%20Datasets%20Issue:%20[INSERT%20ISSUE%20SUBJECT%20HERE]....
I'll ping that email and see who responds.

Thanks very much!! I greatly appreciate your help.