nyc scrapes malformed web links
hancush opened this issue · comments
there are nearly 2k bills in the ocd api with web links formed like "http://legistar.council.nyc.govhttps://legistar.council.nyc.gov/gateway.aspx".
opencivicdata=# select count(*) from opencivicdata_billsource where url like 'http://legistar.council.nyc.govhttps://legistar.council.nyc.gov/gateway.aspx%';
count
-------
1888
(1 row)
these are not valid urls. looks like they started rolling in in sept. 2018.
opencivicdata=# select min(created_at) from opencivicdata_bill as b join opencivicdata_billsource as bs on b.id = bs.bill_id where url like 'http://legistar.council.nyc.govhttps://legistar.council.nyc.gov/gateway.aspx%';
min
------------------------------
2015-09-24 15:05:16.53366+00
(1 row)
looks like this comes from the legislation_detail_url
on the LegistarAPIBillScraper
upstream in python-legistar
.
specifically, the Location
in the headers of the request we check contains the full url already, so it is not necessary to prepend the base url. this is true of new bills (i.e., those with malformed links) as well as old ones (i.e., those that predate the malformed links). this also seems unique to new york.
overriding the entire method on the nyc bill scraper would repeat a lot of code from upstream.
@fgregg – would you be amenable to an approach like adding a _format_legislation_detail_url
method to the upstream object for overriding here?
def _format_legislation_detail_url(self, route):
return self.BASE_WEB_URL + route
def legislation_detail_url(self, matter_id):
gateway_url = self.BASE_WEB_URL + '/gateway.aspx?m=l&id={0}'
legislation_detail_route = requests.head(
gateway_url.format(matter_id)).headers['Location']
return self._format_legislation_detail_url(legislation_detail_route)
my feeling is that if it happens for one city it's probably going to happen in another city.
What about this solution: https://stackoverflow.com/a/8357262/98080
makes sense to me!