opencivicdata / scrapers-us-municipal

Scrapers for US municipal governments.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Handle missing histories

fgregg opened this issue · comments

Some LA Metro Board Reports have missing histories, but we still need to show their last action in the councilmatic application.

Currently, this is handled by some complicated by some view-like code in the councilmatic app, but it should be handled, if practicable, in the data layer.

Ideal to handle this at the data level. Similar to workaround for minutes in the scraper.

To replicate this behavior in the scraper, we need to be able to query the Legistar API for events where a particular matter appears on the agenda (i.e., in an associated event item).

Some reading, specifically this, has led me to believe something like this should work: http://webapi.legistar.com/v1/metro/events/?$filter=EventItems/any(item:%20item/EventItemMatterId%20eq%206276)

Matter 6276 appears on this agenda, so I know there should be at least one result: http://webapi.legistar.com/v1/metro/events/1603/eventitems

The requests are going through ok, but the responses are coming back empty. I emailed Metro to see if they have any insight!

If we can't do this querying, I don't know if it's practical to do this in the scraper. Will revisit when it's not 5 p.m.

Another thing is: How often are bills added to Legistar without a history? If we add an artificial history, do we want to clear it when an actual history is added? Or should it be added to the extras dict? (That's probably a better idea.)

If this isn't practical at the scraper level (i.e., we can't query the Legistar API the way we need to), it might be something we can create during the post save hook for bills, when we'll have full access to the database via the ORM.

We're going to proceed on this assuming that changes won't be made to the Legistar API that allow us to query it the way we'd need to, to calculate this value in the scraper.

Apart from perhaps occurring at the wrong level of the code base, the big issue with our current approach is that it runs a heavy query every time a bill's last action date is needed, either in the UI or when updating or rebuilding the search index. Caching this value would lead to faster page load time and indexing operations.

I propose replacing the last_action_date property on the Councilmatic Bill model with a last_action_date attribute and populating the attribute during the post-save signal for OCD bills. This calculation would add some overhead to the first scrape into a bare database. However, since we scrape bills at such a high frequency, there are generally less than 10 new or updated bills per scrape, which would mitigate overhead on an ongoing basis.

ubuntu@ip-10-0-0-80:~$ grep "bill:" /tmp/lametro.log | grep 'noop$'
  bill: 0 new 0 updated 2920 noop
  bill: 0 new 0 updated 1 noop
  bill: 0 new 0 updated 1 noop
  bill: 0 new 0 updated 2 noop
  bill: 0 new 0 updated 4 noop
  bill: 0 new 0 updated 4 noop
  bill: 0 new 0 updated 3 noop
  bill: 0 new 0 updated 3 noop
  bill: 0 new 0 updated 3 noop
  bill: 0 new 0 updated 1 noop
  bill: 0 new 0 updated 2 noop
  bill: 1 new 0 updated 5 noop
  bill: 0 new 0 updated 6 noop
  bill: 0 new 0 updated 7 noop
  bill: 0 new 0 updated 7 noop
  bill: 0 new 0 updated 8 noop
  bill: 0 new 0 updated 5 noop
  bill: 0 new 0 updated 5 noop
  bill: 0 new 0 updated 5 noop
  bill: 0 new 0 updated 3 noop
  bill: 0 new 0 updated 2 noop
  bill: 0 new 0 updated 2 noop
  bill: 0 new 0 updated 2 noop
ubuntu@ip-10-0-0-80:~$ grep "bill:" /tmp/lametro.log.1 | grep 'noop$'
  bill: 0 new 5 updated 2908 noop
  bill: 0 new 0 updated 1 noop
  bill: 0 new 0 updated 1 noop
  bill: 0 new 0 updated 1 noop
  bill: 0 new 0 updated 2 noop
  bill: 0 new 0 updated 3 noop
  bill: 0 new 0 updated 4 noop
  bill: 0 new 0 updated 6 noop
  bill: 0 new 0 updated 5 noop
  bill: 0 new 0 updated 7 noop
  bill: 0 new 0 updated 6 noop
  bill: 0 new 0 updated 4 noop
  bill: 0 new 0 updated 3 noop
  bill: 0 new 0 updated 3 noop
  bill: 0 new 0 updated 3 noop
  bill: 0 new 0 updated 3 noop
  bill: 0 new 0 updated 3 noop
  bill: 6 new 0 updated 6 noop
  bill: 0 new 0 updated 11 noop
  bill: 1 new 0 updated 12 noop
  bill: 0 new 0 updated 14 noop
  bill: 0 new 0 updated 12 noop
  bill: 0 new 0 updated 7 noop
  bill: 0 new 0 updated 7 noop
  bill: 0 new 0 updated 5 noop
  bill: 0 new 0 updated 4 noop
  bill: 0 new 0 updated 4 noop
  bill: 0 new 0 updated 2 noop
  bill: 0 new 0 updated 2 noop
  bill: 0 new 0 updated 1 noop
  bill: 0 new 0 updated 3 noop
  bill: 0 new 0 updated 3 noop
  bill: 0 new 5 updated 2915 noop

On the plus side, re-calculating this attribute every time a bill is saved ensures that bills for which we'd previously spoofed an action date via agendas would be updated appropriately when a history item is added.

Thoughts, @fgregg?

Update: Hm, looks like the signal approach won't work after all. get_last_action_date depends on bill actions being present, however related objects aren't inserted until after the bill is created in pupa's import process. This makes sense, but it's bad news for us, because it means we aren't getting the last action date for new bills, and we might be setting the wrong one for existing bills, because we don't yet know about new actions.

With this in mind, it seems like we need a few things:

  • Access the the ORM
  • All the data in the database
  • Periodic updates

A signals-based approach gets us the first and third things, but not the second. Setting the attribute as bills are accessed, like we do with packets, might also seem attractive, but it doesn't get us periodic updates.

So it's starting to seem like we need to set this outside of the import cycle, e.g., in a management command, or skip caching and calculate it on the fly. A management command could be ok, but since we aren't running scrapes and downstream ETL in concert, there's still the potential for incomplete data.

Any thoughts, @fgregg?

Related to Metro-Records/la-metro-councilmatic#553, Metro-Records/la-metro-councilmatic#555.