STA 380: Predictive Modeling

Welcome to part 2 of STA 380, a course on predictive modeling in the MS program in Business Analytics at UT-Austin. All course materials can be found through this GitHub page. Please see the course syllabus for links and descriptions of the readings mentioned below.

Office hours

On Friday, July 29th, I will hold office hours from 10am to 12pm (normal class time). I will start in my office (CBA 6.478), but if a lot of folks show up at once, we'll move to the regular classroom.

On Tuesday (8/2) and Thursday (8/4), I will hold office hours from 9-10 AM in CBA 6.478.

Scribe notes and exercises

To submit your scribe report, please e-mail me link to a .pdf or .md file on your own GitHub page (james.scott at mccombs.utexas.edu). Do not send an attachment.

You can find the up-to-date collection of scribe notes here.

The first set of exercises is available here.

Topics

(0) The data scientist's toolbox

Good data-curation and data-analysis practices; R; Markdown and RMarkdown; the importance of replicable analyses; version control with Git and Github.

Readings:

(1) Foundations of probability

Basic probability, and some fun examples. Random variables, probability distributions, expected value. Joint, marginal, and conditional probability. Independence. Law of total probability. Bayes' rule.

Readings:

excerpts from an in-progress book on probability.

Some optional stuff:

Bayes and the search for Air France 447.
YouTube video on Bayes and the USS Scorpion.
Pretty-but-wrong visualization by the New York Times on the long-term failure rates of various contraceptive methods, together with James Trussell's explanation of why the 10-year numbers are wrong. His quote is about halfway down the page. A great example where assuming independence can lead to trouble!

(2) Exploratory analysis

Contingency tables; basic plots (scatterplot, boxplot, histogram); lattice plots; basic measures of association (relative risk, odds ratio, correlation, rank correlation)

Scripts and data:

gdpgrowth.R and gdpgrowth.csv
titanic.R and TitanicSurvival

Readings:

excerpts from my course notes on statistical modeling
NIST Handbook, Chapter 1.
R walkthroughs on basic EDA: contingency tables, histograms, and scatterplots/lattice plots.
Bad graphics
Good graphics: scan through some of the New York Times' best data visualizations

(3) Resampling methods

The bootstrap and the permutation test; joint distributions; using the bootstrap to approximate value at risk (VaR).

Scripts:

Readings:

ISL Section 5.2 for a basic overview.
These notes on bootstrapping and the permutation test.
Section 2 of these notes, on bootstrap resampling. You can ignore the stuff about utility if you want.
This R walkthrough on using the bootstrap to estimate the variability of a sample mean.
Another R walkthrough on the permutation test in a simple 2x2 table.
Any basic explanation of the concept of value at risk (VaR) for a financial portfolio, e.g. here, here, or here.

Optionally, Shalizi (Chapter 6) has a much lengthier treatment of the bootstrap, should you wish to consult it.

(4) Clustering

Basics of clustering; K-means clustering; hierarchical clustering.

Scripts and data:

Readings:

ISL Section 10.1 and 10.3
Elements Chapter 14.3 (more advanced)
K means examples: a few stylized examples to build your intuition for how k-means behaves.
Hierarchical clustering examples: ditto for hierarchical clustering.
K-means++ original paper or simple explanation on Wikipedia. This is a better recipe for initializing cluster centers in k-means than the more typical random initialization.

(5) Latent features and structure

Principal component analysis (PCA). If time: canonical correlation analysis; multi-dimensional scaling.

Scripts and data:

pca_2D.R
pca_intro.R
congress109.R, congress109.csv, and congress109members.csv
gasoline.R and gasoline.csv
FXmonthly.R, FXmonthly.csv, and currency_codes.txt
cca_intro.R, mmreg.csv, and mouse_nutrition.csv

Readings:

ISL Section 10.2 for the basics
Shalizi Chapters 18 and 19 (more advanced). In particular, Chapter 19 has a lot more advanced material on factor analysis, beyond what we covered in class.
Elements Chapter 14.5 (more advanced)

(6) Text data

Co-occurrence statistics; naive Bayes; TF-IDF; topic models; vector-space models of text (if time allows).

Scripts and data:

Readings:

Stanford NLP notes on vector-space models of text, TF-IDF weighting, and so forth.
(Using the tm package)[http://cran.r-project.org/web/packages/tm/vignettes/tm.pdf] for text mining in R.
Dave Blei's survey of topic models.
A pretty long blog post on naive-Bayes classification.

(7) Miscellaneous

Coverage of these topics will depend on the time available. Possibilities include: anomaly detection; label propagation; learning association rules; graph partitioning; partial least squares.

Scripts and data:

playlists.R and playlists.csv

Readings:

Pradeep Ravikumar's notes on association rule mining

About

STA 380: Predictive Modeling

Languages

Language:R 100.0%