briatte / mdsr

More Data Science with R (2024)

Home Page:https://f.briatte.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

> More Data Science with R

François Briatte
Spring 2024. Work VERY MUCH in progress.

A follow-up to an introduction to data science with R, RStudio, and the {tidyverse} packages, still aimed at social scientists. This course requires some prior training in introductory statistics and regression modelling.

N.B. -- the current repo does not include the full set of datasets used during the semester, which are all publicly available. Future versions will include the full data and slides.

Outline

  1. Software
  2. Revisions
  3. SQL databases
  4. Web scraping
  5. Linear models
  6. Panel data
  7. Survey data
  8. Feedback
  9. Multilevel data
  10. Machine learning in R
  11. Machine learning in Python
  12. Dashboards

Bonus sections:

1. Software

  • R and RStudio
  • R Markdown notebooks
  • Code execution

A session to get started again with R and RStudio, this time through R Markdown notebooks, which are dynamic documents that can combine text and images with code as well as plots and other kinds of results.

> Demo: LGBTI inclusivity in OECD countries

2. Revisions

  • The tidyverse package bundle
  • More R Markdown
  • Data pivots

A general-revisions session that covers data wrangling and visualization with various packages of the tidyverse bundle. Now is the right time to take a look at cheatsheets and similar material.

> Demo: U.S. life expectancy (code by Kieran Healy)

3. SQL databases

  • Row-wise operations and complex joins with dplyr
  • SQL databases with dbplyr
  • Regular expressions (regex) with stringr

A session focused on advanced data wrangling. SQL databases, in particular, is what you will need when in need for speed and/or out-of-memory calculation on very (possibly very very) large data.

> Demo: Government cabinet composition (ParlGov data, code by Holger Döring)

4. Web scraping

  • HTTP with httr
  • XPath with rvest and xml2
  • API endpoints

Another session focused on advanced data wrangling. Web scraping is what you will need if your data are trapped online into Web pages.

> Demo: Locating nuclear reactors worldwide (data from the IAEA)

5. Linear models

  • Model estimation and manipulation with broom
  • Linear diagnostics with performance

Mostly revisions of what was covered in the introductory course.

> Demo: Worldwide fertility rates (QOG/World Bank data)

6. Panel data

  • Panel data structure
  • Fixed-effects estimation with fixest and plm
  • Cluster-robust standard errors (CRSEs)

> Demo: Worldwide fertility rates (QOG/World Bank data)

7. Survey data

  • Survey weighting
  • Survey-weighted operations with survey and srvyr
  • Generalized linear models (GLMs)

> Demo: EU skepticism and migration (ESS data, code by Holger Döring)

8. Feedback

Feedback on your first drafts, and recommendations for the coming weeks.

9. Multilevel data

  • Multilevel (hierarchical) data
  • Multilevel (mixed) model estimation with lme4

> Demo: EU skepticism and migration, continued (ESS data, code by Holger Döring)

10. Machine learning in R

  • Machine learning essentials
  • Decision trees and random forests
  • The tidymodels package bundle

> Demo: White Trump voters (CCES data, code by Steven Miller)

11. Machine learning in Python

  • Jupyter notebooks and Google Colab
  • Text mining basics
  • Example algorhithms from the scikit-learn library

> Demo: Trump tweets (Twitter data, code by Bernhard Rieder)

12. Dashboards

  • The flexdashboard package
  • Maps with sf and Leaflet
  • General wrap-up

> Demo: Worldwide air pollution (World Bank data, code by Paul Moraga)


Dependencies

pkg_data <- c("countrycode", "rsdmx", "RSQLite", "sf", "tidyverse")
# ... also installs {DBI} and {rvest}, inter alia
pkg_models <- c("easystats", "lme4", "plm", "fixest", "tidymodels")
# ... installs a lot of essentials, such as {performance}
pkg_tables <- c("broom", "broom.mixed", "DT", "modelsummary", "texreg")
pkg_varia <- c("flexdashboard", "leaflet")

# install.packages("remotes")
for (i in c(pkg_data, pkg_models, pkg_tables, pkg_varia)) {
  remotes::install_cran(i)
}

Credits

The DSR README has a list of relevant credits.

Elsewhere

More to come.

About

More Data Science with R (2024)

https://f.briatte.org/


Languages

Language:Jupyter Notebook 95.7%Language:R 4.3%