Coding for Data Analysis with R

Introduction to Data Analysis with R - lecture materials by Ágoston Reguly (CEU) with Gábor Békés (CEU, KRTK, CEPR)

This course material is a supplement to Data Analysis for Business, Economics, and Policy by Gábor Békés (CEU) and Gábor Kézdi (U. Michigan), Cambridge University Press, 2021.

Textbook information: see the textbook's website gabors-data-analysis.com or visit Cambridge University Press

To get a copy: Inspection copy for instructors or buy from Amazon or order online around the globe

Acknowledgments

We thank CEU Department of Econimics and Business for financial support.

Status

This is version 0.2. (2022-07-11)

Comments are really welcome in email or as a GitHub issue.

Overview

The course serves as an introduction to the R programming language and software environment for data exploration, data munging, data visualization, reporting, and modeling.

Lectures 1 to 11 complements Part I: Data Exploration (Chapter 1-6) focuses on basic programming principles, data structures, data cleaning and data exploration with descriptives and graphs, and simple hypothesis testing. This is an intro package to learning R and using it for exploration and some basic analysis.

Lecture 12 to 20 complements PART II: Regression Analysis (Chapter 7-12) focuses on statistical methods such as nonparametric regression, single and multiple linear cross-sections, binary models and simple time-series analysis while adding more advanced toolkit for visualization and reporting. This is a regression focused package with advnaced features for analysis including markdown.

Lecture 21 to 27 complements PART III: Prediction (Chapter 13-18). These lectures are not intended to be part of an introductory R course, but rather a more advanced seminar to support Data Analysis with machine learning tools for prediction. In this seminar-style course, students will cover topics such as model selection with cross-validation, LASSO, RIDGE or Elastic Net regularization, regression trees with CART, random forest, and boosting. These methods are applied to cross-sectional data, especially to the continuous outcome, and also for binary outcomes to model probability and handle classification problems. Time series modeling on the long run and short run via ARIMA and VAR models are also covered. For properly understanding this material, the prerequisite is to complete the coding lectures from 1 to 19.

Teaching philosophy

We believe students will learn using R by writing scripts and solving problems on their own. We provide and show them good practices on how to carry out such tasks, but extensive usage is needed.

This is not a hardcore coding course, but a course to supplement data analysis. The material focuses on specific issues in this topic and balances between higher levels of coding such as tidyverse -- which is more intuitive, easier to learn, but less flexible -- and lower levels in form of basic coding principles -- which allows greater complexity, deeper understanding, but requires much more practice and has a steeper learning curve.

The material structure reflects these principles. The majority of the lecturers have pre-written codes which include in-class tasks to practice and face problems along with regular homework. This enables the instructor to show a greater variety of codes, good examples for coding, and way more commands and functions than live coding while providing room for practicing. For this type of lecture, homework is essential, as it helps students to deepen their coding skills. There are also few live-coding lectures, which require flexibility and more preparation from the teacher (material provides detailed instructions). These lectures are focusing on basic coding principles such as the introduction to coding, functions, loops, conditionals, etc., and show students possible paths to hardcore coding, while showing alternative methods as well. Exceptions are lecture 21-27 as they are intended to use as a seminar material to support theory and assumes good level of coding. There are no homework and/or in-class tasks.

It is always a good question if solutions for the tasks or homework should be made available for students. We believe show students the in-class solution is beneficial and does not distort motivation as slower learners may want to revise and compare the true solution to their own. Hence, for each lecture, we provide the solutions for these tasks. However, this is not the case for the homework. We found that showing solutions to the students rather depresses their motivation and creativity, therefore there are no solutions for the homework. (It is important that there are (infinitely) many good solutions for an HW, thus we usually encourage students to try out different paths as well.)

How to use

This course material may be used as a basis for a course on learning coding with R for the purpose of analyzing data. It is developed to be taught simultaneously with the textbook but may be used independently. It is rather comprehensive and thus, may be used without any textbook to prepare.

We have not invented the coding wheel. Instead tried to adopt best practices and combine them with real-life case studies from the textbook.

There are no slides, but codes are commented heavily thus it should be easy to follow. In some cases, it is beneficial to read the related case study and/or the chapter to fully appreciate the codes and comments, but not necessary.

Within each lecture, there is an estimated time that the lecture would need with suggestions on how to shorten the lecture if it would be too long. The lectures are -- in purpose -- contain more material than what a classical 100-mins class per week for 12 weeks would take. It is always easier to cut material than add to it and the taste of each instructor and/or class may differ. We highly encourage you to use each lecture as a starting point and modify it accordingly. Later, we propose an example for this 100-mins class per week for a semester (12 weeks).

Sources

The material is based on multiple years of teaching coding courses at Central European University as well as advice from many many great resources such as

Hadley Wickham and Garrett Grolemund R for Data Science
Jae Yeon Kim: R Fundamentals for Public Policy, Course material
Winston Chang: R Graphics Cookbook
Andrew Heiss: Data Visualization with R
Grant McDermott: Data Science for Economists

and many others, listed in the lecture's READMEs.

Lectures, learning outcomes, and case-studies

The following table shows a brief summary of the lectures: what is the type of the lecture, what is the expected learning outcome, and how it relates to the textbook's case studies and datasets.

Lecture	Lecture Type	Learning outcomes	Case-study	Dataset
PART I.
lecture00-intro	live coding or pre-written	Setting up R and RStudio. Introduction to the interface of R-studio. Packages and tryout of `tidyverse` and knitting a pre-written Rmarkdown	-	-
lecture01-coding-basics	live coding	Introduction to coding with R: R-objects, basic operations, functions, vectors, lists	-	-
lecture02-data-imp-n-exp	pre-written	How to import and export data with `readr` and APIs	-	hotels-vienna, football**
lecture03-tibbles	pre-written	Introduces `tibble`-s as data variable. Selecting, adding or removing rows (observations) and columns (variables). Convert to wide and long formta. Merge two tibbles in multiple ways.	Ch 02C: Football Managers	football
lecture04-data-munging	pre-written	Intro to data munging with `dplyr`: add, remove, separate, convert variables, filter observations, etc.	Ch 02A: Hotels prep*	hotels-europe
lecture05-data-exploration	pre-written	Intro to data exploration: `modelsummary` for descriptive stats in various ways, `ggplot2` to plot one variable distributions (histogram, density) and two variable associations (scatter, bin-scatter), `t.test` for simple hypothesis testing.	Core: Ch06A: Online vs offline prices. Related: Ch03A: Hotels: exploration, Ch04A: Management & firm size	billion-prices, wms-management-survey**
lecture06-rmarkdown101	pre-written	Intro to RMarkdown: knitting pdf and Html. Structure of RMarkdown, formatting text, plots and tables.	Ch06A: Online vs offline prices*	billion-prices, hotels-europe**
lecture07-ggplot-indepth	pre-written	Tools to cutomize `ggplot2` graph. Write your own theme. Bar charts, box and violine plots. `theme_bg()` and `source()` from file and url.	Ch03B: Hotels: Vienna vs London	hotels-europe
lecture08-conditionals	live coding	Conditional programming: if-else statements, logical operations with vectors, creating new variables with conditionals.	-	wms-management
lecture09-loops	live coding	Imperative programming with `for` and `while` loops. Exercise to calculate yearly sp500 returns.	Ch05A: Loss on stock portfolio	sp500
lecture10-random-numbers	live coding	Introduction to random number generators and random sampling.	Ch03D: Height and income, Ch05A: Loss on a stock portfolio*	height-income-distributions, sp500
lecture11-functions	live coding	Writing functions: control for input(s) and output(s), error handling. User written confidence-intervals, sampling distribution for t-statistics, bootstrapping.	Ch05A: Loss on a stock portfolio?*, Good-to-know: Ch06A: Online vs offline prices and Ch06B: Testing loss on a stock portfolio	wms-management, sp500
PART II.
lecture12-intro-to-regression	pre-written	Intro to regressions: binary means, binscatters, non-parametric regression via lowess, simple linear regression. Predicted values and residuals.	Ch07A: Hotels with simple regression	hotels-vienna
lecture13-feature-engineering	pre-written	Intro to feature engineering. Covering variable transformations/manipulations which are used in the book/case-studies/this R course. Can be skipped, but good overview.	Ch01C: Data collection, Ch04A: Management & firm size* , Ch08C: Measurement error as HW, Ch17A: Predicting firm exit*	wms-management-survey, bisnode-firms, hotels-vienna**
lecture14-simple-regression	live coding	Level-level, log-level, level-log, log-log, polynomial and linear spline transformations for simple regressions. Weighted OLS. Graphical representation of these models. Model comparison, theory and statistical based decision for model choice.	Ch08B: Life expectancy, Ch08A: Hotels with non-linear as HW	worldbank-lifeexpectancy, hotels-vienna**
lecture15-advanced-linear-regression	pre-written	Introduces to multiple variable regression. Model evaluation: R2, prediction and error analysis with graphs. Confidence and prediction intervals. Robustness tests: checking parameter stability across time/location/type of obs.	Ch09B: Hotel stability, Ch10B: Hotels with multiple regression	hotels-europe
lecture16-binary-models	pre-written	Introduction to binary outcome models: saturated models, linear probability models, logit and probit models. Estimating average marginal effects for non-linear models, via `marginaleffects` and summarize by `modelsummary`. Evaluating models by R2, Pseudo-R2, Brier score and Log-loss. Comparison of predicted probabilities for certain groups and the distribution for different models. Bias of the model and calibration curve.	Ch11A: Smoking health risk	share-health
lecture17-dates-n-times	pre-written	Introduction to basic date and time variable manipulations. `lubridate` and rounding, differencing. Dataset aggregation, differenced and lag-ged variables, unit root tests. Visualize time series.	Ch12A: Returns: company vs market**	stocks-sp500
lecture18-timeseries-regression	pre-written	Introduction to time series analysis. Time-series data manipulations, simple visualizations and (partial) autocorrelation graph. Differencing, lags of outcome and explanatory variables and deterministic seasonality. Using Newey-West standard errors. Model comparison and estimating cumulative effects with valid SEs.	Ch12B: Electricity and temperature	arizona-electricity, case-shiller-la**
lecture19-advaced-rmarkdown	pre-written	RMarkdown formatting for data anaysis report. Chunks, general and local set-options, formatting figures, descriptive tables and model comparison tables. Equations, greek letters and hypothesis testing. Organizing appendix.	Ch10A: Gender wage gap	cps-earnings
lecture20-basic-spatial-vizz	pre-written	Introducing to spatial visualization via `maps` (package based maps) and `rgdal` (user supplied maps). How to create world map and show life expectancy or color the average hotel prices for London boroughs or Vienna districts. Handling maps via `geom_polygon` and set the scaling, colors, etc.	Ch08B: Life expectancy* , Ch03B: Compare hotel prices Vienna vs London*	worldbank-lifeexpectancy, hotels-europe
PART III.
lecture21-cross-validation	seminar	Model comparison introduced by BIC and RMSE. Limitations of these comparisons. Cross-validation: using different samples to tackle overfitting. The `caret` package.	Ch13A Predicting used car value with linear regressions and Ch14A Predicting used car value: log prices	used-cars
ecture22-lasso	seminar	Feature engineering for LASSO: interactions and polynomials. Cross-validation in detail. LASSO (and RIDGE, Elastic Net) via `glmnet`. Training-test samples and the holdout sample to evaluate predictions. LASSO diagnostics.	Ch14B Predicting AirBnB apartment prices: selecting a regression model	airbnb
lecture23-regression-tree	seminar	Estimating regression tree via `rpart`. Understanding regression trees and comparing them to linear regressions. Tuning and setup of CART. Tree and variable importance plots.	CH15A Predicting used car value with regression trees	used-cars
lecture24-random-forest	seminar	Data cleaning and feature engineering specifics for random forest (RF). Estimate RFs via `ranger`. Examine the results of RFs with variable importance plots, and partial dependence plots, and check the quality of predictions in (important) subgroups. Gradient Boosting Method (GBM) via `gbm` package. Prediction comparisons (prediction horse-race) for OLS, LASSO, CART, RF, and GBM.	Ch16A Predicting apartment prices with random forest	airbnb
lecture25-classification-wML	seminar	Predicting probabilities and classification with machine learning tools. Cross validated logit models. LASSO with logit, CART, and Random Forest (bonus: why not use Classification Forest). Classification of probabilities, ROC curve, and AUC. Confusion Matrix. Model comparison via RMSE or AUC. User-defined loss function to weight false-positive and false-negative rate. Optimizing threshold value for classification to get best loss function value.	CH17A Predicting firm exit: probability and classification	bisnode-firms
lecture26-long-term-time-series-wML	seminar	Forecasting time series data on the long run. Feature engineering with time series, deciding transformations for stationarity. Cross-validation options with time series. Modeling with deterministic trend, seasonality and other dummy variables for long term horizon. Evaluation of model and forecast precision. `prophet` as machine learning tool for time series data.	Ch18A Forecasting daily ticket sales for a swimming pool	swim-transactions
lecture27-short-term-time-series-ARIMA-VAR	seminar	Forecasting time series data on the short run. Feature engineering with time series, deciding transformations for stationarity. Cross-validation options with time series. ARIMA and VAR models for short term forecasting. Evaluation of forecasts on short run: performance on hold out set, fan-chart to assess risks and stability of forecasting performance on an extended time period.	CH18B Forecasting a house price index	case-shiller-la

*case study was the base for the material, but coding material is modified

**only used in homework

Folder structure within lectures

Within each lecture there is the following folder structure:

raw_codes: includes codes, which are ready to use during the course but require some live coding in class.
complete_codes: includes codes with suggested solutions to codes in raw_codes
data: in some cases, there is a data folder, which includes data files (typically in '.csv'). I have found it crucial during live-coding classes to make sure everybody has the same data.
if there are no folders then:
- lecture has a notebook format, which implies a complete live-coding class (mostly introduction or technical ''hard-core coding'' lectures)
- lecture has a complete R-script. In this case, the lecturer should pay attention to the interpretation of the material itself rather than to coding. Typically this is for more advanced case studies (chapters 13-18), where there is no new coding technique, but interpreting the results might be challenging.

Learning outcomes and relation to the book

Probably, the largest difference compared to the book is that data handling is the most challenging and most time-consuming part of coding, while it is a relatively little (but as important!) part of the book. It is always a challenge to keep up with the material if the two courses (Data Analysis and Coding) are running parallel. Experience shows that lecture05-data-exploration in this course is the first truly common point with the book and lecture06-rmarkdown101 enables students to submit data analysis material via pdf or HTML. This coding material was developed such that it catches up with the book as quickly as possible, showing truly essential tools to do data handling with the data in an easy way. The result is that after 6 lectures from both courses (teaching Part I. of the book) there is room for common assignment in the form of a descriptive analysis: e.g. carry out a data-collection exercise, clean the data and do exploratory analysis. The 'cost' is that apart from some references or homework there is no true connection between the two courses before lecture05-data-exploration in coding and the data handling skills can be improved even more. Therefore do not expect students to be able to solve (all) of the data exercises from the book (however, there were some positive surprises during the years).

In contrast, Part II in the book deals with regressions of various forms. This is fairly simple from the coding perspective, which allows the lecturer to deepen students' knowledge of

basic coding principles;
add further data handling practices to students' toolkit, and
provide more skills on Rmarkdown, while following the material of the book.

If material is properly taught -- for Part III of the book -- there is no need for an extra coding course, but a simple seminar type of supplement, which put emphasis on interpretation and practice of machine learning methods. This material is provided in the folder part-III-case-studies. In principle after these materials, students should be able to code by themself and understand and work with case study materials related to Part IV.

Case studies and coding lectures

Or one can relate each case study from the book to specific lectures.

Chapter	Case-study	Lecture
Chapter 1	ch01-hotels-data-collect	lecture03-tibbles**
Chapter 2	ch02-football-manager-success	lecture03-tibbles*
	ch02-hotels-data-prep	lecture04-data-munging
	ch02-immunization-crosscountry	lecture04-data-munging**
Chapter 3	ch03-city-size-japan	lecture05-data-exploration**
	ch03-distributions-height-income	lecture05-data-exploration**
	ch03-football-home-advantage	lecture05-data-exploration**
	ch03-hotels-europe-compare	lecture05-data-exploration**, lecture07-ggplot-indepth
	ch03-hotels-vienna-explore	lecture05-data-exploration**
	ch03-simulations	lecture10-random-numbers
Chapter 4	ch04-management-firm-size	lecture05-data-exploration**, lecture07-ggplot-indepth
Chapter 5	ch05-stock-market-loss-generalize	lecture09-loops, lecture10-random-numbers, lecture11-functions
Chapter 6	ch06-online-offline-price-test	lecture05-data-exploration, lecture11-functions*
	ch06-stock-market-loss-test	lecture04-data-munging*, lecture11-functions
Chapter 7	ch07-hotels-simple-reg	lecture12-intro-to-regression
	ch07-ols-simulation	lecture12-intro-to-regression with lecture10-random-numbers
Chapter 8	ch08-hotels-measurement-error	lecture13-feature-engineering
	ch08-hotels-nonlinear	lecture14-simple-regression**
	ch08-life-expectancy-income	lecture14-simple-regression
Chapter 9	ch09-gender-age-earnings	lecture15-advanced-linear-regression**
	ch09-hotels-europe-stability	lecture15-advanced-linear-regression
Chapter 10	ch10-gender-earnings-understand	lecture15-advanced-linear-regression**, lecture19-advaced-rmarkdown
	ch10-hotels-multiple-reg	lecture15-advanced-linear-regression
Chapter 11	ch11-australia-rainfall-predict	lecture16-binary-models**
	ch11-smoking-health-risk	lecture16-binary-models
Chapter 12	ch12-electricity-temperature	lecture18-timeseries-regression
	ch12-stock-returns-risk	lecture17-dates-n-times**
	ch12-time-series-simulations	All of the following**: lecture17-dates-n-times, lecture09-loops and lecture10-random-numbers
Chapter 13	ch13-used-cars-reg	lecture21-cross-validation - first part
Chapter 14	ch14-used-cars-log	lecture21-cross-validation - second part
	ch14-airbnb-reg	lecture22-lasso
Chapter 15	ch15-used-cars-cart	lecture23-regression-tree
Chapter 16	ch16-airbnb-random-forest	lecture24-random-forest
Chapter 17	ch17-predicting-firm-exit	lecture25-classification-wML
Chapter 18	ch18-swimmingpool	lecture26-long-term-time-series-wML
	ch18-case-shiller-la	lecture27-short-term-time-series-ARIMA-VAR

*partial match: the case study is only used as a starting point for the lecture.

**students can understand and replicate material based on that lecture

Example course

As an example for a coding course, which takes one 100-mins class per week for a semester (12 weeks), we have taught the followings:

Class	Lecture(s)	Comments
Class 01	lecture00-intro, lecture01-coding-basics	Installation of R, RStudio, and `tidyverse` package along with knitting an RMarkdown is asked to be done before the class. From coding basics some materials (e.g. numeric vs integer vs double, or indexing or lists) are left out if I run out of time.
Class 02	lecture02-data-imp-n-exp, lecture03-tibbles	Sometimes lecture03-tibbles finished on next class.
Class 03	lecture04-data-munging, start: lecture05-data-exploration	Ask about RMarkdown knitting.
Class 04	Finish: lecture05-data-exploration, lecture06-rmarkdown101	At this point, should assess students that they understand the basics of coding and make sure nobody is struggling. From this class they should be able to prepare for submitting a project for 6th week's assessment, which should be 2 weeks from this point.
Class 05	lecture07-ggplot-indepth, lecture08-conditionals	This class provides some room for repetition or clarifying concepts.
Class 06	lecture09-loops, lecture10-random-numbers and lecture11-functions	Should be a more relaxed class as during these days there are many (other) assessment for student and concentrate more on the joy of programming. Many students may already know this material, try to come up with some entertaining tasks for them as well.
Class 07	lecture12-intro-to-regression, lecture13-feature-engineering	Feature engineering is new material, but fits here quite well. Class 07 should be after first class from Part II, which discusses Chapter 7.
Class 08	lecture14-simple-regression	Great opportunity for in-class (team) work for students with live coding.
Class 09	lecture15-advanced-linear-regression	Make sure students covered Chapter 10 from the book. If not, spatial data visualization is a great substitute here.
Class 10	lecture16-binary-models	In some cases this material is covered as a seminar from the course that discusses Part II. This provides an opportunity to fill any gaps or make class 12 not so dense, by jumping to the next class's material.
Class 11	lecture17-dates-n-times, lecture18-timeseries-regression	If short in time, skip lecture17-dates-n-times
Class 12	lecture19-advaced-rmarkdown, lecture20-basic-spatial-vizz	Two paths: discuss lecture19-advaced-rmarkdown in detail with the whys as well, but then there is no time for lecture20-basic-spatial-vizz. Or stick with the technical details in both lectures, which allows higher probability to finish.
Class *	lecture20-basic-spatial-vizz	This lecture seldomly fits into the timeframe of the class, especially if this coding class runs along with theory classes for Part I and II and serves as a supplement both in coding and understanding the material. However, if there is a mismatch, this class can be flexibly used as a substitute (e.g. theory class is lagging behind)

Our decisions -- you may alter

Tidyverse and not data.table. Some friends love data.table. But it seems, tidyverse has become the more popular choice, especially at a starter level.
Starting with rm(list = ls()) Yes, we know. There is a strong view suggesting project based workflow "If the first line of your R script is rm(list = ls()) I will come into your office and SET YOUR COMPUTER ON FIRE". We are warned directly, too. At the same time, for beginners, this seems a good start. So we kept it for lectures 01-20, not beyond. Feel free to use a version without.
Do descriptive tables with Datasummary -- takes a bit of time to get used to be nice.
All regressions (except when we start) is with fixest. We think it is the future regression command for all uses.

Our thanks

Thanks to all folks who contributed to the codebase for the course, especially Gábor Kézdi, co-author of the book. But also thanks to Zsuzsa Holler, Kinga Ritter, Ádám Víg, Jenő Pál, János Divényi, Marc Kaufmann, Gábors' and Ágoston's many students. Big thanks to Laurent Bergé, Grant McDermott and Vincent Arel-Bundock for awesome packages and all the help on coding over several years.

Found an error or have a suggestion?

Awesome, we know there are errors and bugs. Or just much better ways to do a procedure.

To make a suggestion, please open a GitHub issue here with a title containing the case study name. You may also contact us directly.

oliyiyi / da-coding-rstats