`>` More Data Science with R

François Briatte
Spring 2024. Work VERY MUCH in progress.

A follow-up to an introduction to data science with R, RStudio, and the {tidyverse} packages, still aimed at social scientists. This course requires some prior training in introductory statistics and regression modelling.

N.B. -- the current repo does not include the full set of datasets used during the semester, which are all publicly available. Future versions will include the full data and slides.

Outline

Software
Revisions
SQL databases
Web scraping
Linear models
Panel data
Survey data
Feedback
Multilevel data
Machine learning in R
Machine learning in Python
Dashboards

Bonus sections:

Dependencies
Credits
Elsewhere

1. Software

R and RStudio
R Markdown notebooks
Code execution

A session to get started again with R and RStudio, this time through R Markdown notebooks, which are dynamic documents that can combine text and images with code as well as plots and other kinds of results.

> Demo: LGBTI inclusivity in OECD countries

2. Revisions

The tidyverse package bundle
More R Markdown
Data pivots

A general-revisions session that covers data wrangling and visualization with various packages of the tidyverse bundle. Now is the right time to take a look at cheatsheets and similar material.

> Demo: U.S. life expectancy (code by Kieran Healy)

3. SQL databases

Row-wise operations and complex joins with dplyr
SQL databases with dbplyr
Regular expressions (regex) with stringr

A session focused on advanced data wrangling. SQL databases, in particular, is what you will need when in need for speed and/or out-of-memory calculation on very (possibly very very) large data.

> Demo: Government cabinet composition (ParlGov data, code by Holger Döring)

4. Web scraping

HTTP with httr
XPath with rvest and xml2
API endpoints

Another session focused on advanced data wrangling. Web scraping is what you will need if your data are trapped online into Web pages.

> Demo: Locating nuclear reactors worldwide (data from the IAEA)

5. Linear models

Model estimation and manipulation with broom
Linear diagnostics with performance

Mostly revisions of what was covered in the introductory course.

> Demo: Worldwide fertility rates (QOG/World Bank data)

6. Panel data

Panel data structure
Fixed-effects estimation with fixest and plm
Cluster-robust standard errors (CRSEs)

> Demo: Worldwide fertility rates (QOG/World Bank data)

7. Survey data

Survey weighting
Survey-weighted operations with survey and srvyr
Generalized linear models (GLMs)

> Demo: EU skepticism and migration (ESS data, code by Holger Döring)

8. Feedback

Feedback on your first drafts, and recommendations for the coming weeks.

9. Multilevel data

Multilevel (hierarchical) data
Multilevel (mixed) model estimation with lme4

> Demo: EU skepticism and migration, continued (ESS data, code by Holger Döring)

10. Machine learning in R

Machine learning essentials
Decision trees and random forests
The tidymodels package bundle

> Demo: White Trump voters (CCES data, code by Steven Miller)

11. Machine learning in Python

Jupyter notebooks and Google Colab
Text mining basics
Example algorhithms from the scikit-learn library

> Demo: Trump tweets (Twitter data, code by Bernhard Rieder)

12. Dashboards

The flexdashboard package
Maps with sf and Leaflet
General wrap-up

> Demo: Worldwide air pollution (World Bank data, code by Paul Moraga)

Dependencies

pkg_data <- c("countrycode", "rsdmx", "RSQLite", "sf", "tidyverse")
# ... also installs {DBI} and {rvest}, inter alia
pkg_models <- c("easystats", "lme4", "plm", "fixest", "tidymodels")
# ... installs a lot of essentials, such as {performance}
pkg_tables <- c("broom", "broom.mixed", "DT", "modelsummary", "texreg")
pkg_varia <- c("flexdashboard", "leaflet")

# install.packages("remotes")
for (i in c(pkg_data, pkg_models, pkg_tables, pkg_varia)) {
  remotes::install_cran(i)
}

Credits

The DSR README has a list of relevant credits.

Elsewhere

More to come.

briatte / mdsr

`>` More Data Science with R

Outline

1. Software

2. Revisions

3. SQL databases

4. Web scraping

5. Linear models

6. Panel data

7. Survey data

8. Feedback

9. Multilevel data

10. Machine learning in R

11. Machine learning in Python

12. Dashboards

Dependencies

Credits

Elsewhere

About

Languages

> More Data Science with R

Outline

1. Software

2. Revisions

3. SQL databases

4. Web scraping

5. Linear models

6. Panel data

7. Survey data

8. Feedback

9. Multilevel data

10. Machine learning in R

11. Machine learning in Python

12. Dashboards

Dependencies

Credits

Elsewhere

About

Languages

`>` More Data Science with R