CAVaccineInventory / vial

The Django application powering calltheshots.us

Home Page:https://vial.calltheshots.us

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Populate Location fields: vaccines_offered, accepts_appointments and accepts_walkins

simonw opened this issue · comments

Replaces #504 - needed by #649.

From https://docs.google.com/document/d/17svyCVXcloArj1wbUgu7QwEb6xeIDa-p7C3U0_yLgKg/edit

BASIC RULES

  • For durable fields (more-or-less: info about the location itself), we should not update existing information with each run of the ingestor / each new data source
  • Open question: If a new source has data in a currently-empty field, should we import it?
  • For frequently-changing fields (more-or-less: info about vaccine distribution at the site), we should update information with each report, each run of the ingestor, and each new data source

FIELDS THAT INGESTORS SHOULD BE ABLE TO OVERWRITE

  • Which vaccines are available
  • The "Appointments available" tag
  • The "Walk-ins accepted" / "Appointment required" / "Appointments or walk-ins accepted" tags
  • Plain-text "notes about this location" (if the feed has them; this is equivalent to the "Public notes" fields on locations)

FIELDS THAT INGESTORS SHOULD NEVER OVERWRITE

  • Address, phone #, or hours
  • Could be: okay to overwrite unless they've been manually confirmed or edited in some way?
  • Website
  • Phone #
  • Hours

SPECIFIC ONE-OFF RULES:

  • We trust VF more than anything else for which vaccines are available. If the location exists in VF, we should show VF's versions of which vaccines are available.
  • We trust Vaccine Spotter more than anything else for whether appointments are available
  • If we have public notes from a manual report that are less than a week old, we trust those more than any ingested notes field

The fields in question, added in #494:

vaccines_offered = models.JSONField(
null=True,
blank=True,
help_text="JSON array of strings representing vaccines on offer here - enter 'null' if we do not know",
)
accepts_appointments = models.BooleanField(
null=True, blank=True, help_text="Does this location accept appointments"
)
accepts_walkins = models.BooleanField(
null=True, blank=True, help_text="Does this location accept walkins"
)
public_notes = models.TextField(blank=True, null=True)

These were editable in the VIAL interface for a while - they are no longer editable (or at least they are hidden by default). The number of locations with these fields populated is:

https://vial.calltheshots.us/dashboard/?sql=select+%27vaccines_offered%27+as+label%2C+count%28%2A%29+from+location+where+vaccines_offered+is+not+null%0D%0Aunion+all%0D%0Aselect+%27accepts_appointments%27+as+label%2C+count%28%2A%29+from+location+where+accepts_appointments+is+not+null%0D%0Aunion+all%0D%0Aselect+%27accepts_walkins%27+as+label%2C+count%28%2A%29+from+location+where+accepts_walkins+is+not+null%0D%0Aunion+all%0D%0Aselect+%27public_notes%27+as+label%2C+count%28%2A%29+from+location+where+public_notes+is+not+null+and+public_notes+%21%3D+%27%27%3A8j--LJ1Eyo7eKpaJLN7jOVQGdUI_Y3MBPeygfoytBJ4&sql=select%3AfGunWs486BWXY9jrd1QIMrBUTKRwu9OMHinDrR7m2Rg

label	count
vaccines_offered	109
accepts_walkins	133
accepts_appointments	158
public_notes	45

This is from the short period of time when these were editable. I'm going to export that data and otherwise pretend it didn't exist.

select
  id, public_id, name, full_address,
  vaccines_offered, accepts_walkins, accepts_appointments, public_notes
from
  location
where (
  vaccines_offered is not null
    or
  accepts_appointments is not null
    or 
  accepts_walkins is not null
    or
  (public_notes is not null and public_notes != '')
)

Exported data is here: https://gist.github.com/simonw/d7644d1f444bc4221b3b284f73468360

There are two steps here: backfill the existing data, and ensure that when new reports or source locations are ingested the data is updated to reflect our best available versions.

with source_location_info as (
  select
    id,
    source_uid,
    json_extract_path(import_json::json, 'availability') as availability,
    json_extract_path(import_json::json, 'inventory') as inventory
  from
    source_location
)
select
  count(*)
from
  source_location_info
where
  availability is not null
  or inventory is not null

Too long to run through the dashboard, so I used an unlimited local connection - returned 155,304 - select count(*) from source_location returns 238,815

with source_location_info as (
  select
    id,
    source_uid,
    matched_location_id,
    json_extract_path(import_json::json, 'availability') as availability,
    json_extract_path(import_json::json, 'inventory') as inventory
  from
    source_location
)
select
  count(distinct matched_location_id)
from
  source_location_info
where
  availability is not null
  or inventory is not null

Takes 17.8s and returns 63,730 - and since select count(*) from location where soft_deleted = false returns 75,478 the VAST majority of our locations have useful availability or inventory information in their matched source locations.

I changed that where to:

where
  availability is not null
  and inventory is not null

And it returned 50,228 locations that have a matched source location with BOTH of those fields.

As for reports...

with skip_reports as (
  select report_id from call_report_availability_tag where availabilitytag_id = (
    select id from availability_tag where "group" = 'skip'
  )
)
select count(distinct location_id) from report where id not in (select report_id from skip_reports) and vaccines_offered is not null

https://vial.calltheshots.us/dashboard/?sql=with+skip_reports+as+%28%0D%0A++select+report_id+from+call_report_availability_tag+where+availabilitytag_id+%3D+%28select+id+from+availability_tag+where+%22group%22+%3D+%27skip%27%29%0D%0A%29%0D%0Aselect+count%28distinct+location_id%29+from+report+where+id+not+in+%28select+report_id+from+skip_reports%29+and+vaccines_offered+is+not+null%3AKJargSIxiGOHI9fxg5CQLIfe03j2yIH-2ZP2LwRLlHY

Returns 15,760 - there are 15,760 locations for which we have at least one non-skip report which has populated vaccines_offered data.

One way to look at this is that we have a sequence of opinions about which vaccines are offered - from imported source locations and from reports.

Slight hitch is that since we over-write source locations when we import them we don't have the full history of those opinions stored in our PostgreSQL database - though we likely have them in a git history somewhere.

Also interesting: which of our locations have the most matched source locations?

https://vial.calltheshots.us/dashboard/?sql=select%20%22public_id%22%2C%20count%28%2A%29%20as%20n%20from%20%28select%20source_uid%2C%20location.public_id%20from%20source_location%20join%20location%20on%20source_location.matched_location_id%20%3D%20location.id%29%20as%20results%20group%20by%20%22public_id%22%20order%20by%20n%20desc%3AJZLqOzIQWCMC6HTTdvtjaR0YPMDy89R9V_9x3ngSxSQ

lpptz	63
lxfpc	50
lkqqg	43
ltgbp	33
lxfqk	31
ltfrw	26
lytcz	26
ltchz	25
ltbwh	24
ldmgq	24
lzhfr	23
ldmpt	23
ldmbg	22
lxdxp	22
lxtrg	21
ldkyw	21
lrcqm	21
ldktw	21
lxdxm	21
ldqyp	20
lrbxc	20
ltghm	20
ldwhh	20
lhydw	20
ldwzy	20
ldmfh	20
lcdfch	20
lcdffy	20
ldkyz	20
ldqxr	20
ldkwx	20
ldkyr	20
ldkzr	20
ldmzt	20
ldkwy	20
ldkwh	20
ldkyh	20
ldxqy	20
lyhch	20
ldkzz	20
ldkym	19

https://vial-staging.calltheshots.us/location/lpptz is the top one - that's Walgreens Co. #19134 in CT - because of the CT scrapers: https://vial.calltheshots.us/dashboard/?sql=select+source_uid%2C+source_name%2C+name+from+source_location+where+matched_location_id+%3D+35007%3AncVcHDs5Axbk6K_g7YTGp8BW-m7RT3HWkoABzBgzpgE

Actually that looks bad - I think a bunch of different CT Walgreens may have been incorrectly matched:

source_uid	source_name	name
ct_covidvaccinefinder_gov:605a3748494866ed91e08065	ct_covidvaccinefinder_gov	Walgreens Pharmacy (Glastonbury)
ct_covidvaccinefinder_gov:605a374a494866ed91e0807c	ct_covidvaccinefinder_gov	Walgreens Pharmacy (Hartford, Albany Ave)
ct_covidvaccinefinder_gov:605a374a494866ed91e0807f	ct_covidvaccinefinder_gov	Walgreens Pharmacy (Bristol, Main St)
ct_covidvaccinefinder_gov:605a374c494866ed91e0808a	ct_covidvaccinefinder_gov	Walgreens Pharmacy (Terryville)
ct_covidvaccinefinder_gov:605a374d494866ed91e08095	ct_covidvaccinefinder_gov	Walgreens Pharmacy (Thomaston)
ct_covidvaccinefinder_gov:605a3750494866ed91e080ba	ct_covidvaccinefinder_gov	Walgreens Pharmacy (Litchfield)

I'm going to exclude ct_covidvaccinefinder_gov from this project for the moment.

Sample of the values of availability and inventory taken by sampling the most recent 10,000 source locations that have them:

Availability: https://vial.calltheshots.us/dashboard/?sql=select+availability%3A%3Atext%2C+count%28%2A%29+as+n+from+%28with+source_location_info+as+%28%0D%0A++select%0D%0A++++json_extract_path%28import_json%3A%3Ajson%2C+%27availability%27%29+as+availability%2C%0D%0A++++json_extract_path%28import_json%3A%3Ajson%2C+%27inventory%27%29+as+inventory%2C%0D%0A++++id%2C+source_uid%2C+source_name%2C+name%2C+created_at%2C+matched_location_id%2C+last_imported_at%0D%0A++from%0D%0A++++source_location%0D%0A%29%0D%0Aselect%0D%0A++%2A%0D%0Afrom%0D%0A++source_location_info%0D%0Awhere%0D%0A++availability+is+not+null%0D%0A++or+inventory+is+not+null%0D%0Aorder+by+id+desc+limit+2000%0D%0A%29+as+results+group+by+availability%3A%3Atext+order+by+n+desc%3AeRwjzoJfg5JqhIgsoZyxzQ_vlFClqtZ4i0g84tza1_s

SQL__select_availability__text__count____as_n_from__with_source_location_info_as___select_json_extract_path_import_json__json___availability___as_availability__json_extract_path_import_json__json___inventory___as_inventory__id__source_uid__

Inventory: https://vial.calltheshots.us/dashboard/?sql=select+inventory%3A%3Atext%2C+count(*)+as+n+from+(with+source_location_info+as+(%0D%0A++select%0D%0A++++json_extract_path(import_json%3A%3Ajson%2C+%27availability%27)+as+availability%2C%0D%0A++++json_extract_path(import_json%3A%3Ajson%2C+%27inventory%27)+as+inventory%2C%0D%0A++++id%2C+source_uid%2C+source_name%2C+name%2C+created_at%2C+matched_location_id%2C+last_imported_at%0D%0A++from%0D%0A++++source_location%0D%0A)%0D%0Aselect%0D%0A++*%0D%0Afrom%0D%0A++source_location_info%0D%0Awhere%0D%0A++availability+is+not+null%0D%0A++or+inventory+is+not+null%0D%0Aorder+by+id+desc+limit+10000%0D%0A)+as+results+group+by+inventory%3A%3Atext+order+by+n+desc%3Aiy4Iupdi-OBx0VTi0hN349c7JoqWNwjaR3OBi0LSEbo

SQL__select_inventory__text__count____as_n_from__with_source_location_info_as___select_json_extract_path_import_json__json___availability___as_availability__json_extract_path_import_json__json___inventory___as_inventory__id__source_uid__sou

So those don't always have supply levels.

I ran this (took 20s so not through Django SQL Dashboard) to see which sources have both drop_in and appointment records:

select "source_name", count(*) as n from (with source_location_info as (
  select
    json_extract_path(import_json::json, 'availability', 'drop_in') as drop_in,
    json_extract_path(import_json::json, 'availability', 'appointments') as appointments,
import_json,
    id, source_uid, source_name, name, created_at, matched_location_id, last_imported_at
  from
    source_location
)
select
  *
from
  source_location_info
where
  drop_in is not null and appointments is not null) as results group by "source_name" order by n desc
vaccinate_nj	1088
nyc_arcgis	590
sf_gov	74
mn_gov	22
dc_district	9
ct_state	5

Looks like vaccinefinder_org scrapes only ever have the drop_in true or false key, never the appointments key.

select json_extract_path(import_json::json, 'availability')::text as availability, count(*)
from source_location where source_name = 'vaccinefinder_org'
group by json_extract_path(import_json::json, 'availability')::text
{"drop_in": false}	41252
{"drop_in": true}	8259
	4584

I'm adding the following Location columns, in order to better understand the source of data that we show to our users:

+    vaccines_offered_provenance_report = models.ForeignKey(
+        "Report",
+        null=True,
+        blank=True,
+        related_name="+",
+        help_text="The report that last populated vaccines_offered",
+        on_delete=models.PROTECT,
+    )
+    vaccines_offered_provenance_source_location = models.ForeignKey(
+        "SourceLocation",
+        null=True,
+        blank=True,
+        related_name="+",
+        help_text="The source location that last populated vaccines_offered",
+        on_delete=models.PROTECT,
+    )
+    vaccines_offered_last_updated_at = models.DateTimeField(
+        help_text="When vaccines_offered was last updated",
+        blank=True,
+        null=True,
+    )
+
+    appointments_walkins_provenance_report = models.ForeignKey(
+        "Report",
+        null=True,
+        blank=True,
+        related_name="+",
+        help_text="The report that last populated accepts_walkins and accepts_appointments",
+        on_delete=models.PROTECT,
+    )
+    appointments_walkins_provenance_source_location = models.ForeignKey(
+        "SourceLocation",
+        null=True,
+        blank=True,
+        related_name="+",
+        help_text="The source location that last populated accepts_walkins and accepts_appointments",
+        on_delete=models.PROTECT,
+    )
+    appointments_walkins_last_updated_at = models.DateTimeField(
+        help_text="When accepts_walkins and accepts_appointments were last updated",
+        blank=True,
+        null=True,
+    )

My code doesn't (yet) populate those new columns - that's what the save=False parameter is going to do.

I'm pushing this live to staging and then I'll track down a bunch of interesting examples - locations with multiple reports and source locations - that I can use to demonstrate what the derive_availability_and_inventory() method is currently doing.

Once I'm comfortable with the behaviour of that method, I'll do the following:

  • Hook up the save=True parameter to populate those new database columns
  • Call that method any time a new report is filed, an updated source location is ingested or a new source location is matched
  • Add an API method we can call to apply that method to one (or maybe multiple) locations
  • Write a script that uses that new API method to backfill all of the data
  • Start returning the data from our various APIs

I want to find good example locations for this - locations that have both source locations AND reports against them which cover vaccines offered and availability.

Problem: we don't seem to have any on staging. Here's a query:

 with last_1000_vaccine_source_locations as (
  select * from source_location where json_extract_path(import_json::json, 'inventory') is not null
  order by id desc limit 1000
),
all_reports_with_vaccines as (
  select * from report where vaccines_offered is not null
)
select * from location where id in (
  select matched_location_id from last_1000_vaccine_source_locations
) and id in (
  select location_id from all_reports_with_vaccines
) limit 100

On staging, 0 results: https://vial-staging.calltheshots.us/dashboard/?sql=with+last_1000_vaccine_source_locations+as+(%0D%0A++select+*+from+source_location+where+json_extract_path(import_json%3A%3Ajson%2C+%27inventory%27)+is+not+null%0D%0A++order+by+id+desc+limit+1000%0D%0A)%2C%0D%0Aall_reports_with_vaccines+as+(%0D%0A++select+*+from+report+where+vaccines_offered+is+not+null%0D%0A)%0D%0Aselect+*+from+location+where+id+in+(%0D%0A++select+matched_location_id+from+last_1000_vaccine_source_locations%0D%0A)+and+id+in+(%0D%0A++select+location_id+from+all_reports_with_vaccines%0D%0A)+limit+10%3AkILYiM3iSoJ-PPLcMdgyz3RPr2xnMMh1UWzgBnvXRcg

On production, 65: https://vial.calltheshots.us/dashboard/?sql=with+last_1000_vaccine_source_locations+as+%28%0D%0A++select+%2A+from+source_location+where+json_extract_path%28import_json%3A%3Ajson%2C+%27inventory%27%29+is+not+null%0D%0A++order+by+id+desc+limit+1000%0D%0A%29%2C%0D%0Aall_reports_with_vaccines+as+%28%0D%0A++select+%2A+from+report+where+vaccines_offered+is+not+null%0D%0A%29%0D%0Aselect+%2A+from+location+where+id+in+%28%0D%0A++select+matched_location_id+from+last_1000_vaccine_source_locations%0D%0A%29+and+id+in+%28%0D%0A++select+location_id+from+all_reports_with_vaccines%0D%0A%29+limit+100%3AzRfr7zN9tlDxD50JIB_mKYd5XEzMQ3BinNCBS3RMBNc

Since the code I've written so far is completely safe - it shows things on a debug page but doesn't update any database records - I'm going to ship it to production in order to see more examples.

My biggest question here is around the trustworthiness of our scrapers. We don't want to discard information from a high-
trustworthy source because some other scraper submitted a more recent source location containing less useful data.

I think the fix for that is going to be allow-listing the scrapers - maybe even start with only vaccinefinder_gov scrapers contributing to the result here. That's a small code change that can happen here:

most_recent_source_location_on_vaccines_offered = (
self.matched_source_locations.all()
.order_by("-last_imported_at")
.exclude(import_json__inventory=None)
.first()
)

most_recent_source_location_on_availability = (
self.matched_source_locations.all()
.order_by("-last_imported_at")
.exclude(import_json__availability=None)
.first()
)

I can use the new debug information on /location/x to confirm if this is a good idea or not, by first identifying good examples of locations that have different source location source names with different opinions on vaccines and availability.

https://vial.calltheshots.us/dashboard/?sql=select+location.public_id%2C+count%28%2A%29%2C+array_agg%28source_location.source_name%29%2C+count%28distinct+source_location.source_name%29+as+source_name_count%0D%0Afrom+source_location+join+location+on+source_location.matched_location_id+%3D+location.id%0D%0Agroup+by+location.public_id%0D%0Ahaving+count%28%2A%29+%3E+1%0D%0Aorder+by+source_name_count+desc%3AyJHHgxP6WhS5hZOPmDJxJznetZsjW3n4W6rXXvgiYHs

select location.public_id, count(*), array_agg(source_location.source_name), count(distinct source_location.source_name) as source_name_count
from source_location join location on source_location.matched_location_id = location.id
group by location.public_id
having count(*) > 1
order by source_name_count desc

Here's a top result from that: https://vial.calltheshots.us/location/lqwzd

Carrs_Pharmacy__1809_-_VIAL

This variant of that query returns only locations that also have at least one non-skip report:

select
  location.public_id,
  count(*) as num_source_locations,
  array_agg(distinct source_location.source_name),
  count(distinct source_location.source_name) as num_distinct_source_names
from
  source_location
  join location on source_location.matched_location_id = location.id
where
  -- Only locations that have at least one non-skip report
  location.id in (
    select location_id from report where report.id not in (select report_id from call_report_availability_tag where availabilitytag_id = 20)
  )
group by
  location.public_id
having count(*) > 1
order by num_distinct_source_names desc

https://vial.calltheshots.us/dashboard/?sql=select%0D%0A++location.public_id%2C%0D%0A++count%28%2A%29+as+num_source_locations%2C%0D%0A++array_agg%28distinct+source_location.source_name%29%2C%0D%0A++count%28distinct+source_location.source_name%29+as+num_distinct_source_names%0D%0Afrom%0D%0A++source_location%0D%0A++join+location+on+source_location.matched_location_id+%3D+location.id%0D%0Awhere%0D%0A++--+Only+locations+that+have+at+least+one+non-skip+report%0D%0A++location.id+in+%28%0D%0A++++select+location_id+from+report+where+report.id+not+in+%28select+report_id+from+call_report_availability_tag+where+availabilitytag_id+%3D+20%29%0D%0A++%29%0D%0Agroup+by%0D%0A++location.public_id%0D%0Ahaving+count%28%2A%29+%3E+1%0D%0Aorder+by+num_distinct_source_names+desc%3ABvQBhHy5ELiTQIN_jTAcxtYxRjQfYn8EtUUJUMuru-I

https://vial.calltheshots.us/dashboard/?sql=select%20%22json_extract_path%22%2C%20count%28%2A%29%20as%20n%20from%20%28select%20json_extract_path%28import_json%3A%3Ajson%2C%20%27availability%27%29%3A%3Atext%20from%20source_location%20where%20source_name%20%3D%20%27getmyvax_org%27%29%20as%20results%20group%20by%20%22json_extract_path%22%20order%20by%20n%20desc%3APE6S3PfuMROXLq15I658NslW4tlc2PMasbC3HeyDqBI confirms that getmyvax_org has NEVER returned a "dropins" key, it only ever populates appointments:

SQL__select__json_extract_path___count____as_n_from__select_json_extract_path_import_json__json___availability____text_from_source_location_where_source_name____getmyvax_org___as_results_group_by__json_extract_path__order_by_n_desc

Running this against the DB:

select "json_extract_path", count(*) as n from (select json_extract_path(import_json::json, 'availability')::text from source_location ) as results group by "json_extract_path" order by n desc

Returns this:

NULL	94952
{"appointments": true}	75839
{"drop_in": false}	41698
{"appointments": false}	12170
{"drop_in": true}	8684
{}	5143
{"drop_in": false, "appointments": true}	1288
{"drop_in": false, "appointments": false}	384
{"drop_in": true, "appointments": true}	102
{"drop_in": true, "appointments": false}	16

So the drop_in key is set to true for 8,000+ locations.

Maybe if the most recent source location has no explicit opinion on drop-ins we should fall back to the most recent report, if one exists?

I'm going to upgrade the display of that derived data on the /location page to make it much easier to interpret.

Examples I need to find:

  • A location with many reports and no source locations
  • A location with many source locations and no reports
  • A location with a mix of source locations and reports where a report is most recent
  • A location with a mix of source locations and reports where a source location is most recent

And some reports examples using the query from #650 (comment)

Variant of above query:

select
  location.public_id,
  count(*) as num_source_locations,
  array_agg(distinct source_location.source_name),
  count(distinct source_location.source_name) as num_distinct_source_names
from
  source_location
  join location on source_location.matched_location_id = location.id
where
  -- Only locations that have at least one non-skip report
  location.id in (
    select location_id from report where report.id not in (select report_id from call_report_availability_tag where availabilitytag_id = 20)
  )
group by
  location.public_id
having count(*) > 1
and 'vaccinefinder_org' = any(array_agg(distinct source_location.source_name))
order by num_distinct_source_names desc

The and 'vaccinefinder_org' = any(array_agg(distinct source_location.source_name)) bit in the HAVING clause returns locations that have at least one vaccinefinder_org.

I'm going to wrap up this work up by:

The fact that vaccinespotter_org doesn't populate inventory does not mean we need to exclude it - this code here only considers source locations that DID populate inventory:

most_recent_source_location_on_vaccines_offered = (
self.matched_source_locations.all()
.order_by("-last_imported_at")
.exclude(import_json__inventory=None)
.first()
)

Just realized I need to exclude reports with is_pending_review=True as well. (Already excluding soft_deleted.)

I need to do another in-depth review of places that might add/remove/edit source locations and reports to make sure they all update the derived data correctly.

Here's a progress report on how population is going based on imported source locations and reports:

https://vial.calltheshots.us/dashboard/?sql=select+count%28%2A%29+from+location+where+vaccines_offered_last_updated_at+is+not+null%3Arp-wF28ywmDjfXh-Ie0-SSYdjkFpFOIr79xGimH7BTU&sql=select+count%28%2A%29+from+location+where+appointments_walkins_last_updated_at+is+not+null%3A8e3BMBVMaNJWTi34eaHVsk1uTJB_jPJI0Hge3VJE4kM

select count(*) from location where vaccines_offered_last_updated_at is not null says 13387

select count(*) from location where appointments_walkins_last_updated_at is not null says 13759

That's out of 77,067 not-soft-deleted locations.

I still need to run a back-fill mechanism for locations that haven't had a report or a source location import in the past week.