laurieskelly / Data-Con-Scrape

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Data-Con-Scrape

These are the companion materials for my python webscraping workshop from Boston Data-Con 2014, an excellent conference that was put on by the good people of the Boston Data Community.

The session was advertised as: "A tutorial on python for web scraping, covering BeautifulSoup, and when and how to use Selenium for dynamic pages and comment loading."

What actually happened was: a lively workshop with great questions, interactive bloopers, and very very little Selenium. Thank you to everyone who stayed to the very end to scrape some data with me. Here are the materials and some extras. ~Enjoy

(video of the session is posted here)

It's like trespassing and organizing your desk. @laurieskelly explains joy of web scraping. #bdc14 #python Scrape http://t.co/CRfkv5sGgk

— Mike Combs (@mike3d0g) September 14, 2014

Contents:

For all notebooks:

  • Links here in the README will take you to the published/non-interactive version of the notebooks.
  • Clone this repo and run .ipynb files using ipython notebook to play with them interactively.
  • Data-Con-metacritic.ipynb: Scraping the site metacritic.com to get information about new releases and produce links to visit and get detailed info on each one

  • Data-Con-metacritic2.ipynb: Digging into details on the movie links created in the first metacritic scraper

  • Data-Con_Selenium.ipynb: Tried to find more cool examples for Selenium, ended up finding another workaround and getting totally distracted making a scraper for datatau

Web Scraping Tips

  • If possible for your data collecting project, use an API instead of scraping. It is kinder to the nice people creating the data that you are collecting, more resistant to breaking, and usually more efficient to code. Scraping is "for" cases when APIs are not provided.

  • If possible for your web scraping project, avoid using Selenium. It is more complex to develop scraper code using Selenium, and slower to run. If you can get what you need without Selenium, it is usually better.

  • When you're poking through a website to scrape it, it's a great idea to open the page in Incognito Mode so that your active sessions, plug-ins, etc, do not make the content that you see differ systematically from the content "seen" by Requests.

  • Websites change. When they do scrapers typically break. There are ways to write your selectors or build your scraping logic to be robust to minor changes, but broken scrapers are part of the game. You can't go around them, you can't go under them, so to live through them:

    1. Make your code noisy. Include tests and checks that can detect changes, and notify yourself when something changes.
    2. Save raw html. "Space is cheap," as they say, so saving raw html can allow you retroactively patch the holes in your longitudinal scraping data after you hvae adjusted to a change in page format.
    3. For this reason, ugly sites make great scraping targets. If a page looks like it hasn't been updated since 1998, you might infer that it is less likely to be re-styled and re-structured every 3-6 months.

Random Tips

  • urlparse.urljoin() is a handy way to stick parts of a url together without messing it up and having too many or too few slashes up in there. module docs

Resources

What did I forget?

remind me on twitter or make a pull request : )

About


Languages

Language:Jupyter Notebook 90.7%Language:Python 9.3%