[Feature] Additional data sources

Question

[Feature] Additional data sources

rudigiesler opened this issue 2 years ago · comments

Is your feature request related to a problem? Please describe.
For the ContactNDoH WhatsApp line, currently the covid cases data is manually updated daily, but we're wanting to automate it. We've tried to get access to an official API from the NDoH, but haven't been able to succeed with that yet, so we're investigating scraping that data from official sources. For that we need: total cases, new/latest cases (past day) both total and per province, total full recoveries, total deaths, total vaccines administered, and timestamp of when it was last updated.

For total cases and new/latest cases, that is stored in covid19za_provincial_cumulative_timeline_confirmed.csv, which seems to either be pulled from the HTML of the NICD's website, or gis_nicd_scraper, but I cannot find where that is defined through some brief searching in the repo. When doing some searching for this info, it's contained in https://sacoronavirus.co.za/covid-19-daily-cases/ , which has an embedded dashboard which gets it's data from https://gis.nicd.ac.za/hosting/rest/services/WARDS_MN/MapServer/0/query , which offers a JSON API where we can easily pull data. It provides all the way down to ward level data, but it does only supply current totals, as well as latest totals for the last day, but not historical data. I didn't see this data source being used in this repo, and wondering if there's a reason for that.

For full recoveries, deaths, and vaccines administered, there are totals at the bottom of the homepage https://sacoronavirus.co.za/ , which are quite easily scraped. These are unfortunately just served as HTML, so I couldn't find a source for where these numbers are being pulled. In this repo, it seems like that is being fetched from the daily images using OCR.

Describe the solution you'd like
Initially I was going to create a background task that would scrape the data from the above mentioned sources, store it in a database, and expose an API for the data, along with some basic checks of the data to ensure accuracy (ensure that totals are always increasing, poll every hour and if the data has changed then append with a timestamp, etc).

If there's a way that we could not duplicate efforts, then that would be great. I'd like to understand if there are any reasons for scraping from the sources in this repo, vs the sources I have listed above.

We'll probably want to host the scraper and database on our servers, to ensure that we can fix things quickly if they break, but it will be open source code, and an open API that this repo could scrape.

Describe alternatives you've considered

Getting an official API from the NDoH (not happening any time soon).
Expanding the HTTP API in this repo to expose the information that we need (concerns around uptime and speed of fixes, also running a large production system off of it doesn't seem right if we're not paying for the cost of it)
Building an API off of the CSV data in this repo (CSVs on github are a bit difficult for us to deploy, we would preferable want this data in a database served by an API)

Additional context
You can find the current manually updated content on the whatsapp line here: https://wa.me/27600123456?text=cases

Rudi · Answer 1 · Wed Dec 22 2021 17:12:39 GMT+0800 (China Standard Time)

I've put together scraping and APIs for the 3 sources:

For the images, https://evds-healthcheck-django-prd.covid19-k8s.prd-p6t.org/v2/covidcases/sacoronavirus_images/

For the counters on the homepage: https://evds-healthcheck-django-prd.covid19-k8s.prd-p6t.org/v2/covidcases/sacoronavirus_counters/

For the NICD GIS, after we started scraping and had a history of this source, seems like it's not updated very regularly, and not very reliable (the latest field is often just 0). So we won't be using it to get more detailed breakdowns, but it will continue being scraped and stored. It's available at https://evds-healthcheck-django-prd.covid19-k8s.prd-p6t.org/v2/covidcases/wardcase/ (along with /province, /district, /subdistrict, and /ward), but there's also a flat/denormalized view at https://evds-healthcheck-django-prd.covid19-k8s.prd-p6t.org/v2/covidcases/wardcase/flat

https://evds-healthcheck-django-prd.covid19-k8s.prd-p6t.org/v2/covidcases/contactndoh/ will give you the latest image and counter data, and if we have the day before that, the daily counts. This is what we use to generate the message on ContactNDoH.

Autogenerated docs are available at https://evds-healthcheck-django-prd.covid19-k8s.prd-p6t.org/docs .

Rudi · Answer 2 · Tue Jan 30 2024 22:30:35 GMT+0800 (China Standard Time)

Just an update, this scraping has broken, but since we're no longer using this for any of our services, we don't want to commit resources to updating it. So the data at this API will no longer be receiving updates