hack4impact-calpoly / ecologistics-web-scraper

Home Page:https://ecologistics-web-scraper.vercel.app

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Only Scrape unique hearings

oli-lane opened this issue · comments

Right now, scrape_hearings() will scrape all the meetings on the calendar, even if they're already been scraped. To fix this, we'll add a check so that a given meeting is returned by scrape_hearings() only if it has not been scraped before.

In the scrape_hearings() function in utils/slo_county/scrape_hearings.py, modify the block of code on lines 27-47. Use the unique id, and rather than print it out (as it is right now), check if it's in the hearings collection in the database. If it's not, run the upcoming_hearings.append( {"link": meeting_link, "date": date_string} ) code and add it to the database so it won't be scraped again in the future. If it is in the database already, continue on to the next meeting link.

Note: there is some code commented out in scrape_hearings() that may be useful for this task. There's also an example endpoint that accesses the mongo instance in api/app.py.

Requirements:

  • scrape_hearings.py only returns the links to hearings that have not been scraped before
  • scrape_hearings.py updates the hearingsScraped collection with the unique ids of the hearings it returns