A Primer to Web Scraping with R
Summary
The internet offers a wealth of opportunities to learn about public opinion and social behavior. Data from social networks, search engines or web services open avenues for new ways of measuring human behavior and preferences in previously unknown velocity and variety. Fortunately, the open source programming language R provides advanced functionality to gather data from virtually any imaginable data source on the Web - via classical screen scraping approaches, automated browsing, or by tapping APIs. This allows researchers to stay in one programming environment in the processes of data collection, tidying, analysis, and publication. The talk gives an overview of web technologies fundamental to gather data from internet resources. Further, we will learn about state-of-the-art tools and packages for web scraping with R. Finally, we will also discuss subtleties of the web scraping workflow, such as how to ensure reproducibility and to stay friendly on the web.
Event
AAPOR Webinar
Date and Venue
Wednesday, April 12, 2017, 12:00 - 1:30 PM CDT. Register here.
Instructor
Simon Munzert (website, Twitter)
About this repository
This repository provides supplementary materials to the talk. (Almost) all examples introduced on the slides can be reproduced using the R code documented here.
Accompanying book
Together with Christian Rubba, Peter Meissner, and Dominic Nyhuis, I've written a book on Automated Data Collection with R. Participants might find it useful to consult it as further reading after the webinar.
Technical setup to get the R code to run
Please make sure that the current version of R is installed. If not, update from here: https://cran.r-project.org/
Obviously, feel free to choose the coding environment you feel most comfortable with. I'll use RStudio in the course. You might want to use it, too: https://www.rstudio.com/products/rstudio/download/
Erratum for Windows users
A way to solve an encoding issue on Windows machines in the breweries example, as suggested by Stas Kolenikov:
Replace this line
locations <- str_extract(breweries, "[[:digit:]].+?–")
with the following:
locations <- str_extract(gsub(intToUtf8(0x2013),"-",breweries),"[[:digit:]].+?-")
Online resources
Area | URL | Short description |
---|---|---|
Web technologies, general | http://www.w3.org/ | Base of the World Wide Web Consortium (W3C), also provides access to standards and drafts of web technologies |
http://w3schools.com | Great tutorial playground to learn web technologies interactively | |
https://w3techs.com/technologies | Overview of all kinds of web technologies | |
XML and XPath | http://selectorgadget.com/ | Probably the most useful tool for generating CSS selectors and XPath expressions with a simple point-and-click approach |
http://www.xmlvalidation.com/ | Online XML validator | |
http://www.rssboard.org/ | Information about the Really Simple Syndication standard | |
CSS selectors | http://www.w3schools.com/cssref/css_selectors.asp | W3 Schools CSS reference |
http://flukeout.github.io/ | Interactive CSS selectors tutorial | |
JSON | http://www.json.org/ | Base of the JSON data interchange standard |
http://jsonformatter.curiousconcept.com | Formatting tool for JSON content | |
HTTP | http://httpbin.org | HTTP Request and Response Service; useful to debug HTTP queries |
http://useragentstring.com | Tool to figure out what's behind a User-agent string | |
http://curl.haxx.se/libcurl/ | Documentation of the libcurl library | |
http://www.robotstxt.org/ | Information about robots.txt | |
OAuth | http://oauth.net | Information about the Oauth authorization standard |
http://hueniverse.com/oauth | Great overview of Oauth 1.0 | |
Database technologies | http://db-engines.com | Compendium of existing database management systems |
https://www.thoughtworks.com/insights/blog/nosql-databases-overview | Intro to NoSQL databases | |
Regular expressions | http://www.pcre.org/ | Description of Perl Compatible Regular Expressions |
https://stat.ethz.ch/R-manual/R-devel/library/base/html/regex.html | Regular Expressions as used in base R | |
http://regexone.com/ | Online regex tutorial | |
http://regex101.com | Regex testing environment | |
http://www.regexplanet.com/ | Another regex testing environment | |
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 | The truth about HTML parsing with regular expressions | |
https://www.youtube.com/watch?v=Cv2DpwSCgRw | Yes, there's a regex song | |
Selenium | http://docs.seleniumhq.org | Selenium documentation |
APIs | http://www.programmableweb.com/apis | Overview of many existing web APIs |
http://ropensci.org/ | Platform for R packages that provide access to science data repositories | |
R | http://cran.r-project.org/web/views/WebTechnologies.html | CRAN Task View on Web Technologies and Services - useful to stay in the loop of what's possible with R |
http://tryr.codeschool.com/ | An excellent interactive primer for learning R | |
http://www.r-bloggers.com/ | Blog aggregator which collects entries from many R-related blogs | |
http://planetr.stderr.org | Blog aggregator providing information about new R packages and scientific work related to R | |
http://dirk.eddelbuettel.com/cranberries/ | Dirk Eddelbuetttel's CRANberries blog keeps you up-to-date on new and updated R packages | |
http://www.omegahat.org/ | Home of the "Omega Project for Statistical Computing"; documentation of many important R packages dealing with web-based data | |
https://github.com/ropensci/user2016-tutorial#extracting-data-from-the-web-apis-and-beyond | Web API tutorial from useR 2016 conference by Scott Chamberlain, Karthik Ram, and Garrett Grolemund | |
General web scraping | http://r-datacollection.com | Probably the most useful resource of all |
http://www.stata-datacollection.com | Now let's see if that works… | |
Legal issues | http://www.eff.org/ | Electronic Frontier Foundation, a non-profit organisation which advocates digital rights |
http://blawgsearch.justia.com/ | Search engine for law blogs -- useful if you want to stay informed about recent jurisdiction on digital issues | |
http://en.wikipedia.org/wiki/Web_scraping | See the section on "Legal issues" |