vsquared10 / DSCI6002-student

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

GalvanizeU-University of New Haven
Master of Science in Data Science
DSCI6002: Data Exploration, Feature Engineering, and Statistics for Data Scientists


Logistics

Instructors:

Alessandro Gagliardi	: alessandro@galvanize.com
Amy Yuan        		: amy.yuan@galvanize.com

Data Science Resident (DSR):

Stephanie Tong      	: stong1108@gmail.com

Class Location: 44 Tehama St, 3rd Floor, gU Classroom
Class Time: 1:00 - 2:20 PM, M-T-Th-F
Lab Time: 2:30 - 4:00 PM, M-T-Th-F
Communication: #gu4_sf_stats

Description of the Course

Introduction to the discipline of statistics, with a focus on querying, exploring, understanding and transforming data features for statistical and machine learning applications.

Prerequisites

Basic knowledge of Python 3, probability and statistics is expected prior to taking this course.

By the end of this course, you will be able to:

  • Write Basic SQL Queries
  • Demonstrate Python Programming Skills for Statistical Data Analysis
  • Collect Good Data According to the Purpose of the Analysis
  • Explore Data with Exploratory Data Analysis (EDA) techniques
  • Calculate Probabilities
  • Identify and Manipulate Random Variables / Distributions
  • Make Statistical Inference from Sample to Population
  • Model Quantitative Response Variable with Linear Regression
  • Model Binary Response Variable with Logistic Regression
  • Model Time Series Data

These learning objectives correspond with standards in the Galvanize Mastery Tracker and will be the basis for your evaluation in this course.

Supplementary Materials:

Textbook: OpenIntro Statistics by Diez, Barr and Centinkaya. This book is available for free from OpenIntro.org.

Optional readings:

  • Think Stats, by Allen B. Downey
  • Introductory Econometrics, by Jeffrey M. Wooldridge
  • Learning SQL, by Alan Beaulieu
  • Python for Data Science, by Wes McKinney
  • Cartoon Guide to Statistics, by Larry Gonick & Woollcott Smith

Online Resources Include:

Mastery Tracking:

At Galvanize, Mastery Tracking is utilized to evaluate real-time student performance across standards and issue final grades at the end the course.

Standards are the core-competencies of Galvanize graduates - the knowledge, skills, and habits every student should possess by time they graduate. Standards are measurable, student-focused outcomes that state what students are expected to be able to do by the end of the course. Instructors continually provide formative assessments to monitor student performance and inform their teaching practices. Students who are below ‘mastery’ on a standard are expected to continue practicing said standard (with the instructor’s guidance) until they reach mastery. What matters is that students eventually learn the material, not how many attempts it takes to get there.

Mastery Tracking uses a 4-point scale. Every student is expected to achieve 3 (mastery) across all standards by time they complete the course. 1s and 2s indicate areas where students need further practice and/or interventions to reach mastery. Unlike grades for individual assignments, mastery tracking can always be adjusted according to performance up to and including the final exam.

4 pt Scale:

  1. Falling far below mastery - Meeting none of the success criteria or has egregious errors
  2. Approaching mastery - Meeting some of the success criteria
  3. Mastery - Meeting all of the success criteria
  4. Exceeding mastery - Truly exceeding expectations and demonstrating proficiency at a higher level of rigor

Course Requirements

Class Attendance:

Attendance is mandatory. It is the responsibility of the student to attend all classes. If you have to miss class, due to sickness or other circumstances, please notify your instructor by Slack in advance. Supporting documents (doctor’s notes) should accompany absences due to sickness. Each excused absences beyond 2 or any unexcused absences will result in lowering your overall course grade by ⅓ of an entire letter grade (A->A-, A->B+). It is at the instructor’s discretion to deny any absences or to allow students to make-up assignments, exams, etc. resulted from any absences.

Participation

You must also show up prepared. Each person is important to the dynamic of the class, and therefore students are required to participate in class activities. Expect to be "cold called". I call on students at random not to put you on the spot but to keep you engaged in the material at all times. All preparation materials (book readings, videos and websites) should be covered prior to each class session. They are always required unless explicitly labeled as optional.

Electronic Devices

Cell phones can be highly disruptive to the class environment. Please silence these devices. In the event of an emergency, please step out of the classroom.

Laptops, while necessary for labs, can be a distraction during the lecture portion of the lesson. At the discression of individual instructors, laptops may be allowed for portions of the class but unless instructed otherwise, laptops must remained closed during the quiz and lecture portions of the class unless otherwise indicated.

RAT

The Readiness Assessment Tool (RAT) is intended to ensure that students comprehended the material consumed between classes. Students unsure of their comprehension should bring questions to be addressed before the individual RAT. After each student has answered all the questions on the RAT individually, the class will split into teams, who will then review their answers and attempt to reach consensus. Misunderstandings are often better addressed by peers. It is important that all members of each team understand the solution provided by their team. Finally, the answers to the questions will be gone over by the class, hopefully resolving any final misunderstandings before proceeding with the projects.

Lab Exercises

Participation in and completion of lab exercises is a requirement for this course. Each unit includes exercises to provide practice applying techniques discussed in class and to reveal deficiencies in understanding in preparation for skills tests. These exercises may be done in pairs or small groups. Collaboration must be indicated when turning in your assignment. Unless otherwise stated, answers to the lab exercises are due before the following class (e.g. Labs assigned on Tuesday are due at 1pm on Thursday. Labs assigned on Friday or due at 1pm the following Monday.)

Midterm Project

There will be one midterm project that will assess your comprehension of the preliminary material of the course including data manipulation, generation, probability, and basic statistics. This project will be assigned on Monday, September 26, and will be due the following Monday, October 3, before class, at 1pm. Unlike the lab exercises, the final project must be completed individually.

Final Project

There will be one final summative project that will assess your comprehension of the material throughout the term. This will be assigned during the seventh week of class and will be due on Thursday of finals week, at noon on October 20. Unlike the lab exercises, the final project must be completed individually. Collusion, plagiarism, and cheating will not be tolerated.

Here is a tentative rubric for the final project. The four columns refer to the same four levels of mastery described above. Keep this rubric in mind as we go through the course and make sure that you can do these things so that there will be no surprises when you receive the assignment.

Final Project 1 2 3 4
Write Basic SQL Queries Could not read data from database Read data but queries awkward and inefficient Read and write data to/from database Used SQL to perform analysis prior to loading it into Python
Demonstrate Python Programming Skills for Statistical Data Analysis Python code does not run Python code needlessly inefficient and/or complex Python code does what it is meant to do efficiently and is easy to understand Python code elegant and clear adhering to multiple style standards
Collect Good Data According to the Purpose of the Analysis Did not discuss data collection Discussed data collection but made mistakes Accurately stated assumptions about data collection and identified possible problems Proposed alternative approach to account for possible data collection issues
Explore Data with Exploratory Data Analysis (EDA) techniques No plots or visualization techniques employed Plots are not interpreted or misinterpreted Multiple plots provided with interpretation Used a plotting library not covered in class and explain the results clearly
Calculate Probabilities Did not calculate probabilities Provided some summary statistics, but insufficient Provided summary statistics and discussed their relevance Employed ECDF to explore the probability space
Identify and Manipulate Random Variables / Distributions Did not discuss distributions Incorrectly identified the distribution family Correctly identified the distribution family Verified that the distribution was as perceived
Make Statistical Inference from Sample to Population Did not discuss hypothesis testing Incorrectly defined null and/or alternative hypothesis Correctly established null and alternative hypotheses Explored more than one possible alternative hypothesis
Model Quantitative Response Variable with Linear Regression Did not attempt multiple regression Attempted multiple regression but somehow got it wrong or misinterpreted results Applied multiple regression and interpreted results correctly Applied multiple approaches to multiple regression (i.e. feature selection) and discussed results
Model Binary Response Variable with Logistic Regression No attempt to use the GLM Attempted to use GLM but got it wrong Used GLM and got meaningful results Detailed analysis comparing multiple GLM families
Model Time Series Data No attempt to model time series data Attempted to model time series data but got it wrong Modeled time series data using AR and/or MA and got meaningful results Applied and discussed ARIMA, ARMAX, or some other more advanced technique

Grading

The breakdown of the grade will be as follows:

  • Mastery Tracker: 20%
  • Midterm Project: 15%
  • Final Project: 40%
  • Labs: 20%
  • Participation: 5%

Academic Integrity

GalvanizeU is an academic community based on the principles of honesty, trust, fairness, respect, and responsibility. Academic integrity is a core University value which ensures respect for the academic reputation of the University, its students, faculty and staff, and the degrees it confers.

The University expects that all students will learn in an environment where they work independently in the pursuit of knowledge, conduct themselves in an honest and ethical manner and respect the intellectual work of others. Each member of the University community has a responsibility to be familiar with the definitions contained in, and adhere to, the Academic Integrity Policy. Violations of the Academic Integrity Policy include, but are not limited to:

  • Cheating -- i.e. Don't read off of your neighbors exams
  • Collusion -- Group work is encouraged except on evaluative exams. When working together (on exercises, etc.), acknowledgment of collaboration is required.
  • Plagiarism -- Reusing code presented in labs and lectures is expected, but copying someone else's solution to a problem is a form of plagiarism (even if you change the formatting or variable names).
  • Facilitating academic dishonesty

Students who are dishonest in any class assignment or exam will receive an "F" in this course. More information regarding UNH’s official academic integrity policies are outlined in here. If you are ever in doubt, ask the instructor for clarification.

Tentative Schedule

  1. Getting Started
    1. Introduction & Precourse Review
    2. Working with Data
    3. Exploratory Data Analysis
    4. Introduction to Linear Regression
  2. Probability
    1. LABOR DAY: No Class
    2. Introduction to Probability
    3. Wednesday @ 3pm: Bayes Theorem (no class Thursday)
    4. Introduction to Random Variables
  3. Distributions
    1. Review
    2. Discrete Probability Distributions
    3. Continuous Probability Distributions
    4. Jointly Distributed Random Variables
  4. Foundations for Inference
    1. Likelihood
    2. Point Estimation
    3. Central Limit Theorem
    4. Confidence Intervals
  5. Hypothesis Testing
    1. Review
    2. Hypothesis Testing I
    3. Hypothesis Testing II
    4. Chi-Square Tests
  6. Multiple Regression
    1. Regression Redux
    2. Multiple Regression I
    3. Multiple Regression II
    4. Regression Diagnostics
  7. Advanced Topics
    1. Introduction to Time Series
    2. Moving Averages Models and ARMA
    3. Introduction to Logistic Regression
    4. The Generalized Linear Model (GLM)
  8. Finals Week
    1. Review

About


Languages

Language:Jupyter Notebook 100.0%Language:Python 0.0%