gastonstat / stat159

Reproducible and Collaborative Statistical Data Science

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

STAT 159 - Reproducible and Collaborative Statistical Data Science

  • Descrtipion: This course teaches "the why and how" of reproducible and collaborative research by combining questions of good computational practice in science, open science and statistical data analysis, in the context of today's research environment. We will interleave practical topics in software engineering and statistical computing with broader discussions on elements of reproducible data analysis.

  • Details: We will rely on R and RStudio ecosystems, but the core ideas presented here can be equally implemented with tools in Python, Julia, or any other programming language.

  • Instructor: Gaston Sanchez

  • Lecture: 3 hours of lecture per week

  • Assignments: around 6 HW assignments

  • Exams: typically one midterm exam, and final project

  • Prerequisites: Statistics 133, 134, 135

  • Policies:


Software

We are going to be using several tools along this course. Which means that you will have to install the following programs (if you run into any installation problems google it first or check youtube videos; if that doesn't work then ask the GSI or the instructor):


1. Introduction

πŸ“‡ ABOUT:

We begin with the usual review of the course policies, logistics, overall expectations, topics in a nutshell, etc.

Every Data Analysis Project goes through a cycle: At the conceptual level, we'll identify the main stages of the data analysis cycle using sports data Long Jump world records which are one of the oldest standing records in athletics.


πŸ“– READING:

  • Slides

✏️ TOPICS:

  • Introduction

    • The Data Analysis Cycle (DAC)
    • First contact with R and RStudio
  • How not to do a Data Analysis

    • Understand limitations of WYSIWYG tools
    • Advantages of using WYSIWYM tools

2. Reproducibility Crisis

πŸ“‡ ABOUT:

In this module we review an infamous case of irreproducibility: the Reinhart-Rogoff Debacle


πŸ“– READING:

Reinhart and Rogoff Reading Materials


✏️ TOPICS:

  • RR Case Study
    • Who are Reinhart and Rogoff (R&R)?
    • What is their affiliation?
    • About their working paper "Growth in times of debt" (GTD)
      • What is the main thesis of the paper?
      • What are their main findings?
      • What are their conclusions?
    • Story behind R&R fiasco:
      • After the publication of GTD, who tries to reproduce their work?
      • What is the story of the irreproducibility attempt?
      • What is the cause of the irreproducibility?

3. Introduction to Markdown

πŸ“‡ ABOUT:

Markdown is a lightweight markup language, originally created by John Gruber and Aaron Swartz allowing people "to write using an easy-to-read, easy-to-write plain text format, then convert it to structurally valid XHTML (or HTML)".

  • About markup languages
  • Why do we need to use a markup language (and a text editor)?
  • What is the issue with word processors?

Dynamic Documents with R

  • What is a dynamic document?
  • Dynamic documents and markup languages
  • Dynamic documents require a parser and renderer
  • In R, we have the packages "knitr", "rmarkdown", and "shiny"
  • Before knitr we had "Sweave" (with LaTeX)
  • LaTeX is still the de rigueur scientific typesetting system

πŸ“– READING:

  • Slides

✏️ TOPICS:

  • Markdown

  • R Markdown

    • Working with so-called "Dynamic Documents"
    • Weaving and Knitting
    • Combining narrative and code

About

Reproducible and Collaborative Statistical Data Science