nonsignificantp / stats-roadmap

An opinionate and personal collection of books, courses and materiales for learning epidemiology, statistics and machine learning.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Stats road-map

Hi, I'm a junior data scientist coming from clinical research. I've gained experience with experimental design, causal inference and basic statistical analysis while becoming a physician at Buenos Aires University. I've always being inclined towards epidemiology and statistics, gaining a profound love of stats and math as an adult. It was only a matter of time before I found out about data science and machine learning. From then on, I haven't stop reading, watching lectures, programming and practicing with datasets. Coming from a field outside computer science or statistics gave me a different approach on learning the needed skills, but also a different set of problems to solve on my way. Because of this, I decided to keep a record of all materials that helped me achieve those 'eureka' moments. I hope this list helps you too, feel free to push changes or ideas!

Books

The hundred page machine learning book: A beautifully crafted book with python examples and a friendly introduction on the mathematics behind lots of algorithms. It also takes you to the work flow of problem solving in ML.

An introduction to statistical learning with applications in R: A good second step before reading one of the bibles of machine learning.

The elements of statistical learning: A masterpiece that soon or later one has to read. Above the level of hundred-page machine learning but below Machine learning by Murphy.

Machine Learning: A probabilistic perspective: Amazing book that starts with stats and probability concept. Requires high knowledge on math and related concepts. A good book to start, stop and come back from time to time to realize that you understand more that the previous time.

Practical Regression and Anova using R: If you ever wonder how to do a regression fit step by step by hand, then this is the place to go. It also explain all parts of the output summary and how each one is calculated.

Forecasting: Principles and Practice: A comprehensive introduction to forecasting methods using R.

Regression modeling strategies: Commonly used clinical prediction models. The author is a clinical researcher, so this is stats through the eyes of a doctor.

R for Data Science: Hadley Wickham introduces R and the tidyverse package on an easy to read and very comprehensive book. I love everything that Wickham does, check out his classes on youtube too!

Interpertable Machine Learning

Experimental design & epidemiology

Causal inference: What's causal inference and how can it be achieve? Hernan use logic, models and direct acyclic graphs to answer this question.

Chapters

Essentials of Clinical Research: Most of the chapters are fully available on research-gate.

  • Chapter 17 - Bias, Confounding, and Effect Modification: A good place to start when trying to learn these concepts.
  • Chapter 18 - It's all about uncertainty: This chapter is aimed at providing the foundation for common sense issues that underlie why and what statistics is.

Data Analysis Using Regression and Multilevel/Hierarchical Models

  • Chapter 25 - Missing data imputation: Types of missing values and different imputing models to deal with them.

Articles

Count data

Visualizing count data using rootograms: Rootograms are an awesome tool for visualizing the under/over-dispersion phenomena seen in Poisson and Negative Binomial models for count data.

Regression models for count data in R: A revision of the most common used models for count data using R.

Hurdle models

Gettin started with hurdle models: Step by step introduction for fitting and assessing hurdle models using R.

Interpreting hurdle models: Reading the output of hurdle models by using STATA output as an example.

Gradient boosting

How to explain gradient boosting by Parr and Howard: A very concise guide made up of 3 parts that works as an introduction for learning GB fundamentals. Math notation is involved when explaining the models, but doesn't go to deep on it.

Generalized Estimating Equation

Dependent samples from STAT 504

Hypothesis testing

Tutorial on Fisher's exact test

Causal inference

Using causal diagrams to understand problems of confounding and selection bias: Heads up on dags, confounding and colliding.

Missing data & imputation

Reducing bias in treatment effect estimation in observational studies suffering from missing data

GAM

Overview GAMM analysis of time series data

Doing magic and analyzing seasonal time series with GAM (Generalized Additive Model in R

Geo-spatial analysis with generalized additive models

Videos & Courses

Policy Analysis Using interrupted Time Series: Edx course on how to perform interrupted time series and regression discontinuity design.

Causal Diagrams: Draw your assumptions before your conclusions: Learn to use direct acyclic graphs for drawing causal inference. Helpful for identifying and illustrating the presence of confounders, mediators and colliders.

Amazon machine learning course for data scientists: A must do course that introduce in a comprehensive manner the fundamental math behind regression.

Youtube

StatQuest with Josh Starmer: A video series explaining multiple stats concepts in a simple light but in a thorough way.

JBStatistics: Multiple concepts explained by problem solving.

CRISP-DM: The dominant process for data mining: A step by step guide on cross-industry standard process for data mining. CRISP-DM is a guideline that helps data scientist to structure the problems in a business framework and make solutions more communicable for the target public.

Stata learner: A HarvardX course that was previously feature in EdX. It deals mostly with theoretical aspects of experimental design and how to perform data analysis using STATA. For those looking for an introduction to experimental design, you should give a shot to this video series.

Understanding the Chapman–Kolmogorov equation: Visualizing Chapman-Kolmogorov by using Markov chains on time series.

Mathematics for Machine Learning Full Course - Linear Algebra and Multivariate Calculus

Graph theory playlist

Liz Sander | Evolutionary Algorithms Perfecting the Art of "Good Enough": An introduction to genetic and evolutionary algorithms.

Multi-Objective Problems

Creating correct and capable classifiers - Ian Ozsvald

Selection bias: The elephant in the room - Lucas Bernardi

Bayes

Vincent D Warmerdam - The Duct Tape of Heroes Bayesian statistics

Stackoverflow & Reddit

On poisson regression models to estimate relative risk for binary outcomes.

Null hypothesis of Chi-square test for independence.

Explaining what a singular matrix is

Relative risk standard errors and confidence interval

Guidelines for writing data analysis report

Programming & Notebooks

Titanic Database - Technical Analysis

Geo-spatial analysis

Introduction to spatial analysis in R: Using sf and raster packages.

Spatial analysis in R with the sf package

Tidy spatial data in R: using dplyr, tidyr, and ggplot2 with sf

About

An opinionate and personal collection of books, courses and materiales for learning epidemiology, statistics and machine learning.