Data-Con-Scrape

These are the companion materials for my python webscraping workshop from Boston Data-Con 2014, an excellent conference that was put on by the good people of the Boston Data Community.

The session was advertised as: "A tutorial on python for web scraping, covering BeautifulSoup, and when and how to use Selenium for dynamic pages and comment loading."

What actually happened was: a lively workshop with great questions, interactive bloopers, and very very little Selenium. Thank you to everyone who stayed to the very end to scrape some data with me. Here are the materials and some extras. ~Enjoy

(video of the session is posted here)

It's like trespassing and organizing your desk. @laurieskelly explains joy of web scraping. #bdc14 #python Scrape http://t.co/CRfkv5sGgk
— Mike Combs (@mike3d0g) September 14, 2014

Web Scraping Tips

If possible for your data collecting project, use an API instead of scraping. It is kinder to the nice people creating the data that you are collecting, more resistant to breaking, and usually more efficient to code. Scraping is "for" cases when APIs are not provided.
If possible for your web scraping project, avoid using Selenium. It is more complex to develop scraper code using Selenium, and slower to run. If you can get what you need without Selenium, it is usually better.
When you're poking through a website to scrape it, it's a great idea to open the page in Incognito Mode so that your active sessions, plug-ins, etc, do not make the content that you see differ systematically from the content "seen" by Requests.
Websites change. When they do scrapers typically break. There are ways to write your selectors or build your scraping logic to be robust to minor changes, but broken scrapers are part of the game. You can't go around them, you can't go under them, so to live through them:
1. Make your code noisy. Include tests and checks that can detect changes, and notify yourself when something changes.
2. Save raw html. "Space is cheap," as they say, so saving raw html can allow you retroactively patch the holes in your longitudinal scraping data after you hvae adjusted to a change in page format.
3. For this reason, ugly sites make great scraping targets. If a page looks like it hasn't been updated since 1998, you might infer that it is less likely to be re-styled and re-structured every 3-6 months.

Random Tips

urlparse.urljoin() is a handy way to stick parts of a url together without messing it up and having too many or too few slashes up in there. module docs

Resources

Selenium docs
- I think that Waits are the trickiest part of using Selenium.
XPATH selector resources
- w3schools reference
- oldschool-looking and awesome tutorial
And a helper for CSS selectors

What did I forget?

remind me on twitter or make a pull request : )

laurieskelly / Data-Con-Scrape

Data-Con-Scrape

These are the companion materials for my python webscraping workshop from Boston Data-Con 2014, an excellent conference that was put on by the good people of the Boston Data Community.

Contents:

/

/metacritic

/selenium

/datatau_scraper

Web Scraping Tips

Random Tips

Resources

What did I forget?

About

Languages