Welcome to part 2 of STA 380, a course on predictive modeling in the MS program in Business Analytics at UT-Austin. All course materials can be found through this GitHub page. Please see the course syllabus for links and descriptions of the readings mentioned below.
On Friday, July 29th, I will hold office hours from 10am to 12pm (normal class time). I will start in my office (CBA 6.478), but if a lot of folks show up at once, we'll move to the regular classroom.
On Tuesday (8/2) and Thursday (8/4), I will hold office hours from 9-10 AM in CBA 6.478.
To submit your scribe report, please e-mail me link to a .pdf or .md file on your own GitHub page (james.scott at mccombs.utexas.edu). Do not send an attachment.
You can find the up-to-date collection of scribe notes here.
The first set of exercises is available here.
Good data-curation and data-analysis practices; R; Markdown and RMarkdown; the importance of replicable analyses; version control with Git and Github.
Readings:
- a few introductory slides
- Jeff Leek's guide to sharing data
- Introduction to RMarkdown
- Introduction to GitHub
Basic probability, and some fun examples. Random variables, probability distributions, expected value. Joint, marginal, and conditional probability. Independence. Law of total probability. Bayes' rule.
Readings:
- excerpts from an in-progress book on probability.
Some optional stuff:
- Bayes and the search for Air France 447.
- YouTube video on Bayes and the USS Scorpion.
- Pretty-but-wrong visualization by the New York Times on the long-term failure rates of various contraceptive methods, together with James Trussell's explanation of why the 10-year numbers are wrong. His quote is about halfway down the page. A great example where assuming independence can lead to trouble!
Contingency tables; basic plots (scatterplot, boxplot, histogram); lattice plots; basic measures of association (relative risk, odds ratio, correlation, rank correlation)
Scripts and data:
Readings:
- excerpts from my course notes on statistical modeling
- NIST Handbook, Chapter 1.
- R walkthroughs on basic EDA: contingency tables, histograms, and scatterplots/lattice plots.
- Bad graphics
- Good graphics: scan through some of the New York Times' best data visualizations
The bootstrap and the permutation test; joint distributions; using the bootstrap to approximate value at risk (VaR).
Scripts:
Readings:
- ISL Section 5.2 for a basic overview.
- These notes on bootstrapping and the permutation test.
- Section 2 of these notes, on bootstrap resampling. You can ignore the stuff about utility if you want.
- This R walkthrough on using the bootstrap to estimate the variability of a sample mean.
- Another R walkthrough on the permutation test in a simple 2x2 table.
- Any basic explanation of the concept of value at risk (VaR) for a financial portfolio, e.g. here, here, or here.
Optionally, Shalizi (Chapter 6) has a much lengthier treatment of the bootstrap, should you wish to consult it.
Basics of clustering; K-means clustering; hierarchical clustering.
Scripts and data:
Readings:
- ISL Section 10.1 and 10.3
- Elements Chapter 14.3 (more advanced)
- K means examples: a few stylized examples to build your intuition for how k-means behaves.
- Hierarchical clustering examples: ditto for hierarchical clustering.
- K-means++ original paper or simple explanation on Wikipedia. This is a better recipe for initializing cluster centers in k-means than the more typical random initialization.
Principal component analysis (PCA). If time: canonical correlation analysis; multi-dimensional scaling.
Scripts and data:
- pca_2D.R
- pca_intro.R
- congress109.R, congress109.csv, and congress109members.csv
- gasoline.R and gasoline.csv
- FXmonthly.R, FXmonthly.csv, and currency_codes.txt
- cca_intro.R, mmreg.csv, and mouse_nutrition.csv
Readings:
- ISL Section 10.2 for the basics
- Shalizi Chapters 18 and 19 (more advanced). In particular, Chapter 19 has a lot more advanced material on factor analysis, beyond what we covered in class.
- Elements Chapter 14.5 (more advanced)
Co-occurrence statistics; naive Bayes; TF-IDF; topic models; vector-space models of text (if time allows).
Scripts and data:
- textutils.R
- nyt_stories.R and selections from the New York Times.
- tm_examples.R and selections from the Reuters newswire.
- naive_bayes.R
- simple_mixture.R
- congress109_topics.R
Readings:
- Stanford NLP notes on vector-space models of text, TF-IDF weighting, and so forth.
- (Using the tm package)[http://cran.r-project.org/web/packages/tm/vignettes/tm.pdf] for text mining in R.
- Dave Blei's survey of topic models.
- A pretty long blog post on naive-Bayes classification.
Coverage of these topics will depend on the time available. Possibilities include: anomaly detection; label propagation; learning association rules; graph partitioning; partial least squares.
Scripts and data:
Readings: