nyc scrapes malformed web links

Question

nyc scrapes malformed web links

hancush opened this issue 6 years ago · comments

hannah cushman garland commented 6 years ago

there are nearly 2k bills in the ocd api with web links formed like "http://legistar.council.nyc.govhttps://legistar.council.nyc.gov/gateway.aspx".

opencivicdata=# select count(*) from opencivicdata_billsource where url like 'http://legistar.council.nyc.govhttps://legistar.council.nyc.gov/gateway.aspx%';
 count
-------
  1888
(1 row)

these are not valid urls. looks like they started rolling in in sept. 2018.

opencivicdata=# select min(created_at) from opencivicdata_bill as b join opencivicdata_billsource as bs on b.id = bs.bill_id where url like 'http://legistar.council.nyc.govhttps://legistar.council.nyc.gov/gateway.aspx%';
             min
------------------------------
 2015-09-24 15:05:16.53366+00
(1 row)

hannah cushman garland · Answer 1 · Sat Jan 12 2019 05:46:25 GMT+0800 (China Standard Time)

looks like this comes from the legislation_detail_url on the LegistarAPIBillScraper upstream in python-legistar.

specifically, the Location in the headers of the request we check contains the full url already, so it is not necessary to prepend the base url. this is true of new bills (i.e., those with malformed links) as well as old ones (i.e., those that predate the malformed links). this also seems unique to new york.

overriding the entire method on the nyc bill scraper would repeat a lot of code from upstream.

@fgregg – would you be amenable to an approach like adding a _format_legislation_detail_url method to the upstream object for overriding here?

def _format_legislation_detail_url(self, route):
    return self.BASE_WEB_URL + route

def legislation_detail_url(self, matter_id):
    gateway_url = self.BASE_WEB_URL + '/gateway.aspx?m=l&id={0}'

    legislation_detail_route = requests.head(
        gateway_url.format(matter_id)).headers['Location']

    return self._format_legislation_detail_url(legislation_detail_route)

Forest Gregg · Answer 2 · Sat Jan 12 2019 05:52:42 GMT+0800 (China Standard Time)

my feeling is that if it happens for one city it's probably going to happen in another city.

What about this solution: https://stackoverflow.com/a/8357262/98080

hannah cushman garland · Answer 3 · Sat Jan 12 2019 05:53:51 GMT+0800 (China Standard Time)

makes sense to me!