opencivicdata / scrapers-us-municipal

Scrapers for US municipal governments.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

nyc scrapes malformed web links

hancush opened this issue · comments

there are nearly 2k bills in the ocd api with web links formed like "http://legistar.council.nyc.govhttps://legistar.council.nyc.gov/gateway.aspx".

opencivicdata=# select count(*) from opencivicdata_billsource where url like 'http://legistar.council.nyc.govhttps://legistar.council.nyc.gov/gateway.aspx%';
 count
-------
  1888
(1 row)

these are not valid urls. looks like they started rolling in in sept. 2018.

opencivicdata=# select min(created_at) from opencivicdata_bill as b join opencivicdata_billsource as bs on b.id = bs.bill_id where url like 'http://legistar.council.nyc.govhttps://legistar.council.nyc.gov/gateway.aspx%';
             min
------------------------------
 2015-09-24 15:05:16.53366+00
(1 row)

looks like this comes from the legislation_detail_url on the LegistarAPIBillScraper upstream in python-legistar.

specifically, the Location in the headers of the request we check contains the full url already, so it is not necessary to prepend the base url. this is true of new bills (i.e., those with malformed links) as well as old ones (i.e., those that predate the malformed links). this also seems unique to new york.

overriding the entire method on the nyc bill scraper would repeat a lot of code from upstream.

@fgregg – would you be amenable to an approach like adding a _format_legislation_detail_url method to the upstream object for overriding here?

def _format_legislation_detail_url(self, route):
    return self.BASE_WEB_URL + route

def legislation_detail_url(self, matter_id):
    gateway_url = self.BASE_WEB_URL + '/gateway.aspx?m=l&id={0}'

    legislation_detail_route = requests.head(
        gateway_url.format(matter_id)).headers['Location']

    return self._format_legislation_detail_url(legislation_detail_route)

my feeling is that if it happens for one city it's probably going to happen in another city.

What about this solution: https://stackoverflow.com/a/8357262/98080

makes sense to me!