BrianDiggs / webservices

CRAN Task View for interacting with data on the web via web services, and parsing data from the web

Home Page:http://cran.r-project.org/web/views/WebTechnologies.html

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CRAN Task View: Web Technologies and Services


Maintainer: Scott Chamberlain, Karthik Ram, Christopher Gandrud, Patrick Mair Contact: scott at ropensci.org Version: 2014-04-15


This task view contains information about using R to obtain and parse data from the web. The base version of R does not ship with many tools for interacting with the web. Thankfully, there are an increasingly large number of tools for interacting with the web. A list of available packages and functions is presented below, grouped by the type of activity. If you have any comments or suggestions for additions or improvements for this taskview, go to Github and submit an issue, or make some changes and submit a pull request. If you can't contribute on Github, send Scott an email. If you have an issue with one of the packages discussed below, please contact the maintainer of that package.

Tools for Working with the Web from R

Parsing Data from the Web

  • txt, csv, etc.: you can use read.csv() after acquiring the csv file from the web via e.g., getURL() from RCurl. read.csv() works with http but not https, i.e.: read.csv("http://..."), but not read.csv("https://...").
  • The repmis package contains a source_data() command to load plain-text data from a URL (either http or https).
  • The package XML contains functions for parsing XML and HTML, and supports xpath for searching XML (think regex for strings). A helpful function to read data from one or more HTML tables is readHTMLTable().
  • XML2R: The XML2R package is a collection of convenient functions for coercing XML into data frames. The development version is on GitHub here.
  • An alternative to XML is selectr, which parses CSS3 Selectors and translates them to XPath 1.0 expressions. XML package is often used for parsing xml and html, but selectr translates CSS selectors to XPath, so can use the CSS selectors instead of XPath. The selectorgadget browser extension can be used to identify page elements.
  • The rjson converts R object into Javascript object notation (JSON) objects and vice-versa.
  • An alternative to the rjson is RJSONIO which also converts to and from data in JSON format (it is fast for parsing).
  • An alternative to rjson and RJSONIO is jsonlite, a fork of the RJSONIO. It includes the parser from RJSONIO, but implements a different mapping between R objects and JSON strings.
  • Custom formats: Some web APIs provide custom data formats which are usually modified xml or json, and handled by XML and rjson or RJSONIO, respectively.
  • The RHTMLForms allows to read HTML documents and obtain a description of each of the forms it contains, along with the different elements and hidden fields
  • scrapeR provides additional tools for scraping data from HTML and XML documents.

Curl, HTTP, FTP, HTML, XML, SOAP

  • RCurl: A low level curl wrapper that allows one to compose general HTTP requests and provides convenient functions to fetch URIs, get/post forms, etc. and process the results returned by the Web server. This provides a great deal of control over the HTTP/FTP connection and the form of the request while providing a higher-level interface than is available just using R socket connections. It also provide tools for Web authentication.
  • httr: A light wrapper around RCurl that makes many things easier, but still allows you to access the lower level functionality of RCurl. It has convenient http verbs: GET(), POST(), PUT(), DELETE(), PATCH(), HEAD(), BROWSE(). These wrap functions are more convenient to use, though less configurable than counterparts in RCurl. The equivalent of httr's GET() in RCurl is getForm(). Likewise, the equivalent of httr 's POST() in RCurl is postForm(). http status codes are helpful for debugging http calls. This package makes this easier using, for example, stop_for_status() gets the http status code from a response object, and stops the function if the call was not successful. See also warn_for_status(). Note that you can pass in additional Curl options to the config parameter in http calls.
  • The XMLRPC package provides an implementation of XML-RPC, a relatively simple remote procedure call mechanism that uses HTTP and XML. This can be used for communicating between processes on a single machine or for accessing Web services from within R.
  • The XMLSchema package provides facilities in R for reading XML schema documents and processing them to create definitions for R classes and functions for converting XML nodes to instances of those classes. It provides the framework for meta-computing with XML schema in R
  • RTidyHTML interfaces to the libtidy library for correcting HTML documents that are not well-formed. This library corrects common errors in HTML documents.
  • SSOAP provides a client-side SOAP (Simple Object Access Protocol) mechanism. It aims to provide a high-level interface to invoke SOAP methods provided by a SOAP server.
  • Rcompression: Interface to zlib and bzip2 libraries for performing in-memory compression and decompression in R. This is useful when receiving or sending contents to remote servers, e.g. Web services, HTTP requests via RCurl.
  • The CGIwithR package allows one to use R scripts as CGI programs for generating dynamic Web content. HTML forms and other mechanisms to submit dynamic requests can be used to provide input to R scripts via the Web to create content that is determined within that R script.
  • httpRequest: HTTP Request protocols. Implements the GET, POST and multipart POST request.

Authentication

  • Using web resources can require authentication, either via API keys, OAuth, username:password combination, or via other means. Additionally, sometimes web resources that require authentication be in the header of an http call, which requires a little bit of extra work. API keys and username:password combos can be combined within a url for a call to a web resource (api key: http://api.foo.org/?key=yourkey; user/pass: http://username:password@api.foo.org), or can be specified via commands in RCurl or httr. OAuth is the most complicated authentication process, and can be most easily done using httr. See the 6 demos within httr, three for OAuth 1.0 (linkedin, twitter, vimeo) and three for OAuth 2.0 (facebook, github, google). ROAuth is a package that provides a separate R interface to OAuth. OAuth is easier to to do in httr, so start there.

Web Frameworks

  • The shiny package makes it easy to build interactive web applications with R.
  • The Rook web server interface contains the specification and convenience software for building and running Rook applications.
  • The opencpu framework for embedded statistical computation and reproducible research exposes a web API interfacing R, LaTeX and Pandoc. This API is used for example to integrate statistical functionality into systems, share and execute scripts or reports on centralized servers, and build R based apps.
  • A package by Yihui Xie called servr provides a simple HTTP server to serve files under a given directory based on the httpuv package.
  • The httpuv package, made by Joe Cheng at RStudio, provides low-level socket and protocol support for handling HTTP and WebSocket requests directly within R. Another related package, perhaps which httpuv replaces, is websockets, also made by Joe Cheng.
  • websockets: A simple HTML5 websocket interface for R, made by Joe Cheng.
  • Plot.ly is a company that allows you to create visualizations in the web using R (and Python). They have an R package in development here, as well as access to their services via an API here.
  • The WADL package provides tools to process Web Application Description Language (WADL) documents and to programmatically generate R functions to interface to the REST methods described in those WADL documents.
  • The RDCOMServer provides a mechanism to export R objects as (D)COM objects in Windows. It can be used along with the RDCOMClient package which provides user-level access from R to other COM servers.
  • The RSelenium package (not on CRAN) provides a set of R bindings for the Selenium 2.0 webdriver using the JsonWireProtocol. Selenium automates browsers. Using RSelenium you can automate browsers locally or remotely. This can aid in automated application testing, load testing and web scraping. Examples are given interacting with popular projects such as shiny and sauceLabs.

JavaScript

  • ggvis (not on CRAN) makes it easy to describe interactive web graphics in R. It fuses the ideas of ggplot2 and shiny, rendering graphics on the web with Vega.
  • rCharts (not on CRAN) allows for interactive javascript charts from R.
  • rVega (not on CRAN) is an R wrapper for Vega.
  • clickme (not on CRAN) is an R package to create interactive plots.
  • animint (not on CRAN) allows an interactive animation to be defined using a list of ggplots with clickSelects and showSelected aesthetics, then exported to CSV/JSON/D3/JavaScript for viewing in a web browser.
  • The SpiderMonkey package provides a means of evaluating JavaScript code, creating JavaScript objects and calling JavaScript functions and methods from within R. This can work by embedding the JavaScript engine within an R session or by embedding R in an browser such as Firefox and being able to call R from JavaScript and call back to JavaScript from R.

Data Sources on the Web Accessible via R

Ecological and Evolutionary Biology

  • rvertnet: A wrapper to the VertNet collections database API.
  • rgbif: Interface to the Global Biodiversity Information Facility API methods.
  • rfishbase: A programmatic interface to fishbase.org.
  • treebase: An R package for discovery, access and manipulation of online phylogenies.
  • taxize: Taxonomic information from around the web.
  • dismo: Species distribution modeling, with wrappers to some APIs.
  • rnbn (not on CRAN): Access to the UK National Biodiversity Network data.
  • rWBclimate (not on CRAN): R interface for the World Bank climate data.
  • rbison: Wrapper to the USGS Bison API.
  • neotoma (not on CRAN): Programmatic R interface to the Neotoma Paleoecological Database.
  • rnpn (not on CRAN): Wrapper to the National Phenology Network database API.
  • rfisheries: Package for interacting with fisheries databases at openfisheries.org.
  • rebird: A programmatic interface to the eBird database.
  • flora: Retrieve taxonomical information of botanical names from the Flora do Brasil website.
  • Rcolombos: This package provides programmatic access to Colombos, a web based interface for exploring and analyzing comprehensive organism-specific cross-platform expression compendia of bacterial organisms.
  • Reol: An R interface to the Encyclopedia of Life (EOL) API. Includes functions for downloading and extracting information off the EOL pages.
  • rPlant: An R interface to the the many computational resources iPlant offers through their RESTful application programming interface. Currently, rPlant functions interact with the iPlant foundational API, the Taxonomic Name Resolution Service API, and the Phylotastic Taxosaurus API. Before using rPlant, users will have to register with the iPlant Collaborative. http://www.iplantcollaborative.org/discover/discovery-environment
  • ecoengine: The ecoengine ( http://ecoengine.berkeley.edu/) provides access to more than 2 million georeferenced specimen records from the Berkeley Natural History Museums. http://bnhm.berkeley.edu/
  • spocc: A programmatic interface to many species occurrence data sources, including GBIF, USGS's BISON, iNaturalist, Berkeley Ecoinformatics Engine eBird, AntWeb, and more as they sources become easily available.

Genes and Genomes

  • cgdsr: R-Based API for accessing the MSKCC Cancer Genomics Data Server (CGDS).
  • rsnps: This package is a programmatic interface to various SNP datasets on the web: openSNP, NBCI's dbSNP database, and Broad Institute SNP Annotation and Proxy Search. This package started as a library to interact with openSNP alone, so most functions deal with openSNP.
  • rentrez: Talk with NCBI entrez using R.
  • seqinr: Exploratory data analysis and data visualization for biological sequence (DNA and protein) data.
  • seq2R: Detect compositional changes in genomic sequences - with some interaction with GenBank. Archived on CRAN.
  • primerTree: Visually Assessing the Specificity and Informativeness of Primer Pairs.
  • hoardeR: Information retrieval from NCBI databases, with main focus on Blast.
  • RISmed: Download content from NCBI databases. Intended for analyses of NCBI database content, not reference management. See rpubmed for more literature oriented stuff from NCBI.

Earth Science

  • RNCEP: Obtain, organize, and visualize NCEP weather data.
  • crn: Provides the core functions required to download and format data from the Climate Reference Network. Both daily and hourly data are downloaded from the ftp, a consolidated file of all stations is created, station metadata is extracted. In addition functions for selecting individual variables and creating R friendly datasets for them is provided.
  • BerkeleyEarth: Data input for Berkeley Earth Surface Temperature. Archived on CRAN.
  • waterData: An R Package for retrieval, analysis, and anomaly calculation of daily hydrologic time series data.
  • CHCN: A compilation of historical through contemporary climate measurements scraped from the Environment Canada Website Including tools for scraping data, creating metadata and formating temperature files.
  • decctools: Provides functions for retrieving energy statistics from the United Kingdom Department of Energy and Climate Change and related data sources. The current version focuses on total final energy consumption statistics at the local authority, MSOA, and LSOA geographies. Methods for calculating the generation mix of grid electricity and its associated carbon intensity are also provided.
  • Metadata: Collates metadata for climate surface stations. Archived on CRAN.
  • sos4R: A client for Sensor Observation Services (SOS) as specified by the Open Geospatial Consortium (OGC). It allows users to retrieve metadata from SOS web services and to interactively create requests for near real-time observation data based on the available sensors, phenomena, observations etc. using thematic, temporal and spatial filtering.
  • raincpc: The Climate Prediction Center's (CPC) daily rainfall data for the entire world, from 1979 to the present, at a resolution of 50 km (0.5 degrees lat-lon). This package provides functionality to download and process the raw data from CPC. Development version on GitHub here.
  • weatherData: Functions that help in fetching weather data from websites. Given a location and a date range, these functions help fetch weather data (temperature, pressure etc.) for any weather related analysis.
  • soilDB: A collection of functions for reading data from USDA-NCSS soil databases.
  • rnoaa: R interface to NOAA Climate data API.

Economics and Business

  • WDI: Search, extract and format data from the World Bank's World Development Indicators.
  • The Zillow package provides an R interface to the Zillow Web Service API. It allows one to get the Zillow estimate for the price of a particular property specified by street address and ZIP code (or city and state), to find information (e.g. size of property and lot, number of bedrooms and bathrooms, year built.) about a given property, and to get comparable properties.
  • sweSCB: Interface for the REST API of Statistics Sweden. Fetch information on data hierarchy stored behind the API; extract metadata; fetch actual data; and clean up results.

Finance

  • RDatastream (not on CRAN): An R interface to the Thomson Dataworks Enterprise SOAP API (paid), with some convenience functions for retrieving Datastream data specifically.
  • Datastream2R (not on CRAN): Another package for accessing the Datastream service. This package downloads data from the Thomson Reuters DataStream DWE server, which provides XML access to the Datstream database of economic and financial information.
  • quantmod: Functions for financial quantitative modelling as well as data acqusition, plotting and other utilities.
  • TFX: Connects to TrueFX(tm) for free streaming real-time and historical tick-by-tick market data for dealable interbank foreign exchange rates with millisecond detail.
  • fImport: Environment for teaching "Financial Engineering and Computational Finance"
  • Rbitcoin: Ineract with Bitcoin. Both public and private API calls. Support HTTP over SSL. Debug messages of Rbitcoin, debug messages of RCurl, error handling.
  • Thinknum: Interacts with the Thinknum API.
  • pdfetch: A package for downloading economic and financial time series from public sources.

Chemistry

  • rpubchem: Interface to the PubChem Collection.

Agriculture

  • FAOSTAT: The package hosts a list of functions to download, manipulate, construct and aggregate agricultural statistics provided by the FAOSTAT (Food and Agricultural Organization of the United Nations) database.
  • cimis: R package for retrieving data from CIMIS, the California Irrigation Management Information System.

Literature, Metadata, Text, and Altmetrics

  • rplos: A programmatic interface to the Web Service methods provided by the Public Library of Science journals for search.
  • rbhl: R interface to the Biodiversity Heritage Library (BHL) API.
  • rmetadata (not on CRAN): Get scholarly metadata from around the web.
  • RMendeley: Implementation of the Mendeley API in R.
  • rentrez: Talk with NCBI entrez using R.
  • rorcid (not on CRAN): A programmatic interface the Orcid.org API.
  • rpubmed (not on CRAN): Tools for extracting and processing Pubmed and Pubmed Central records.
  • rAltmetric: Query and visualize metrics from Altmetric.com.
  • alm: R wrapper to the almetrics API platform developed by PLoS.
  • ngramr: Retrieve and plot word frequencies through time from the Google Ngram Viewer.
  • scholar provides functions to extract citation data from Google Scholar. Convenience functions are also provided for comparing multiple scholars and predicting future h-index values.
  • The Sxslt package is an R interface to Dan Veillard's libxslt translator. It allows R programmers to use XSLT directly from within R, and also allows XSL code to make use of R functions.
  • The Aspell package provides an interface to the aspell library for checking the spelling of words and documents.
  • OAIHarvester: Harvest metadata using the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH).
  • RefManageR: Import and Manage BibTeX and BibLaTeX references with RefManager.
  • pubmed.mineR: An R package for text mining of PubMed Abstracts. Supports fetching text and XML from PubMed.
  • tm.plugin.webmining: Retrieve structured text data from various web sources. Facilitates text retrieval from feed formats like XML (RSS, ATOM) and JSON. Also direct retrieval from HTML is supported. As most (news) feeds only incorporate small fractions of the original text tm.plugin.webmining even retrieves and extracts the text of the original text source. See the vignette for an intro.

Marketing

  • anametrix: Bidirectional connector to Anametrix API.

Data Depots

  • dvn: Provides access to The Dataverse Network API.
  • rfigshare: Programmatic interface for Figshare.
  • factualR: Thin wrapper for the Factual.com server API.
  • dataone: A package that provides read/write access to data and metadata from the DataONE network of Member Node data repositories.
  • yhatr: Lets you deploy, maintain, and invoke models via the Yhat REST API.
  • RSocrata: Provided with a Socrata dataset resource URL, or a Socrata SoDA web API query, returns an R data frame. Converts dates to POSIX format. Supports CSV and JSON. Manages throttling by Socrata.
  • Quandl: A package that interacts directly with the Quandl API to offer data in a number of formats usable in R, as well as the ability to upload and search.
  • rdatamarket: Fetches data from DataMarket.com, either as timeseries in zoo form (dmseries) or as long-form data frames (dmlist).
  • infochimps: An R wrapper for the infochimps.com API services, from Drew Conway. The CRAN version is archived. Development on Github.

Machine Learning as a Service

  • bigml: BigML, a machine learning web service.
  • MTurkR: Access to Amazon Mechanical Turk Requester API via R.

Web Analytics

  • rgauges: This package provides functions to interact with the Gaug.es API. Gaug.es is a web analytics service, like Google analytics. You have to have a Gaug.es account to use this package.
  • RSiteCatalyst: Functions for accessing the Adobe Analytics (Omniture SiteCatalyst) Reporting API.
  • r-google-analytics (not on CRAN): Provides access to Google Analytics.
  • RGoogleTrends provides programmatic access to Google Trends data. This is information about the popularity of a particular query.

News

  • GuardianR: Provides an interface to the Open Platform's Content API of the Guardian Media Group. It retrieves content from news outlets The Observer, The Guardian, and guardian.co.uk from 1999 to current day.
  • RNYTimes provides interfaces to several of the New York Times Web services for searching articles, meta-data, user-generated content and best seller lists.

Images, Graphics, Videos, Music

  • imguR: A package to share plots using the image hosting service imgur.com. (also see the function imgur_upload() in knitr, which uses the newer Imgur API version 3)
  • RLastFM: A package to interface to the last.fm API. Archived on CRAN.
  • The RUbigraph package provides an R interface to a Ubigraph server for drawing interactive, dynamic graphs. You can add and remove vertices/nodes and edges in a graph and change their attributes/characteristics such as shape, color, size.

Sports

  • nhlscrapr: Compiling the NHL Real Time Scoring System Database for easy use in R.
  • pitchRx: Tools for Collecting and Visualizing Major League Baseball PITCHfx Data
  • bbscrapeR (not on CRAN yet): Tools for Collecting Data from nba.com and wnba.com
  • fbRanks: Association Football (Soccer) Ranking via Poisson Regression - uses time dependent Poisson regression and a record of goals scored in matches to rank teams via estimated attack and defense strengths.

Maps

  • RgoogleMaps: This package serves two purposes: It provides a comfortable R interface to query the Google server for static maps, and use the map as a background image to overlay plots within R.
  • The R2GoogleMaps package - which is different from RgoogleMaps
    • provides a mechanism to generate JavaScript code from R that displays data using Google Maps.
  • osmar: This package provides infrastructure to access OpenStreetMap data from different sources to work with the data in common R manner and to convert data into available infrastructure provided by existing R packages (e.g., into sp and igraph objects).
  • ggmap: Allows for the easy visualization of spatial data and models on top of Google Maps, OpenStreetMaps, Stamen Maps, or CloudMade Maps using ggplot2.
  • The GeoIP package maps IP addresses and host names to geographic locations - latitude, longitude, region, city, zip code, etc.
  • The RKML is an implementation that provides users with high-level facilities to generate KML, the Keyhole Markup Language for display in, e.g., Google Earth.
  • RKMLDevice allows to create R graphics in KML format in a manner that allows them to be displayed on Google Earth (or Google Maps).
  • LeafletR allows you to display your spatial data on interactive web-maps using the open-source JavaScript library Leaflet.

Social media

  • streamR: This package provides a series of functions that allow R users to access Twitter's filter, sample, and user streams, and to parse the output into data frames. OAuth authentication is supported.
  • twitteR: Provides an interface to the Twitter web API.
  • The Rflickr package provides an R interface to the Flickr photo management and sharing application Web service.
  • Rfacebook: Provides an interface to the Facebook API.
  • plusser has been designed to to facilitate the retrieval of Google+ profiles, pages and posts. It also provides search facilities. Currently a Google+ API key is required for accessing Google+ data.

Government

  • wethepeople: An R client for interacting with the White House's "We The People" petition API.
  • govStatJPN: Functions to get public survey data in Japan.
  • acs: Download, manipulate, and present data from the US Census American Community Survey.

Google Web Services

  • RGoogleStorage provides programmatic access to the Google Storage API. This allows R users to access and store data on Google's storage. We can upload and download content, create, list and delete folders/buckets, and set access control permissions on objects and buckets.
  • The RGoogleDocs package is an example of using the RCurl and XML packages to quickly develop an interface to the Google Documents API.
  • translate: Bindings for the Google Translate API v2
  • googlePublicData: An R library to build Google's public data explorer DSPL metadata files.
  • googleVis: Interface between R and the Google chart tools.
  • gooJSON: A Google JSON data interpreter for R which contains a suite of helper functions for obtaining data from the Google Maps API JSON objects.
  • plotGoogleMaps: Plot SP or SPT(STDIF,STFDF) data as HTML map mashup over Google Maps.
  • plotKML: Visualization of spatial and spatio-temporal objects in Google Earth.
  • bigrquery (not on CRAN): An interface to Google's bigquery from R.
  • GFusionTables (not on CRAN): An R interface to Google Fusion Tables. Google Fusion Tables is a data mangement system in the cloud. This package provides R functions to browse Fusion Tables catalog, retrieve data from Gusion Tables dtd storage to R and to upload data from R to Fusion Tables

Amazon Web Services

  • AWS.tools: An R package to interact with Amazon Web Services (EC2/S3).
  • RAmazonS3 package provides the basic infrastructure within R for communicating with the S3 Amazon storage server. This is a commercial server that allows one to store content and retrieve it from any machine connected to the Internet.
  • RAmazonDBREST provides an interface to Amazon's Simple DB API.
  • MTurkR: Access to Amazon Mechanical Turk Requester API via R.

Other

  • sos4R: R client for the OGC Sensor Observation Service.
  • datamart: Provides an S4 infrastructure for unified handling of internal datasets and web based data sources. Examples include dbpedia, eurostat and sourceforge.
  • rDrop (not on CRAN): Dropbox interface.
  • zendeskR: This package provides an R wrapper for the Zendesk API.
  • AWS.tools: An R package to interact with Amazon Web Services (EC2/S3).
  • The qualtrics package provides functions to interact with the Qualtrics online survey tool.

CRAN packages:

Related links:

About

CRAN Task View for interacting with data on the web via web services, and parsing data from the web

http://cran.r-project.org/web/views/WebTechnologies.html