trungmnguyen / t81_577_data_science

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

T81 577 Applied Data Science for Practitioners

Washington University in St. Louis

Instructor: Asim Banskota

Spring 2020, Wednesday, 6:00 PM - 9:00 PM , Cortex COLLAB Medium Classroom, 303-2

Course Description

Organizations are rapidly transforming the way they ingest, integrate, store, serve data, and perform analytics. In this course, students will learn the steps involved with designing and implementing data science projects. Topics addressed include: ingesting and parsing data from various sources, dealing with messy and missing data, transforming and engineering features, building and evaluating machine learning models, and visualizing results. Using Python based tools such as Numpy, Pandas, and Scikit-learn, students will complete a practical data science project that addresses the entire design and implementation process. Students will also become familiar with the best practices and current trends in data science including code documentation, version control, reproducible research, pipeline automation, and cloud computing. Upon completion of the course, students will emerge equipped with data science knowledge and skills that can be applied from day one on the job.

Syllabus

Week Content
Week 1
1/15/2020
Introductions Assignment 1.1: Install anaconda and test Jupyter notebook
Assignment 1.2: Set up of AWS account, installation of AWS client, starting an EC2 engine, and S3 repository
Week 2
1/22/2020
Python Fundamentals Assignment 2: Programming practice assignment
Week 3
1/29/2020
Coding Best Practices in Data Science
  • 3.1. Version control
  • 3.2. Code documentation
  • 3.3. Packaging codes
  • 3.4. Tools for Python code quality
Assignment 3.1.: Exercise of version control with git
Assignment 3.2. Exercise on code documentation and enforcing standards
Week 4
2/5/2020
Modeling Overview
  • 4.1. Types of models
    • 4.1.1. Descriptive/Prescriptive/Predictive
    • 4.1.2. Statistical vs Machine learning
    • 4.1.3. Blackbox vs Explainable
    4.2. Model development steps
    • 4.2.1. Framing questions
    • 4.2.2. Data ingestion and wrangling
    • 4.2.3. Data Preprocessing
    • 4.2.4. Model fitting and evaluation
    • 4.2.5 Model deployment
    • 4.2.6. Performance monitoring and redevelopment
Quiz Modeling Overview
Week 5
2/12/2020
Accessing Data
  • 5.1. Introduction to RESTful APIs
  • 5.2. Accessing data from API using request module and Postman
  • 5.3. Overview of JSON-formatted data
  • 5.4. Parsing JSON data
  • 5.5. Importing commonly used files formatted data
  • 5.6. Reading data from PostgreSQL database
Assignment 4: Finalization of final project topic and data set (Not graded)
Week 6
2/19/2020
Numpy/Pandas for Data Munging/Wrangling
  • 6.1. Pandas and numpy data structure
  • 6.2. Querying and reading data
  • 6.3. Reshaping, Indexing, slicing, and filtering data
  • 6.4. Join, Merge, and Aggregation
  • 6.5. Vectorization
  • 6.6. Basic statistics and plotting
Assignment 5: Data wrangling with Numpy and Pandas
Week 7
2/26/2020
Exploratory Data Analysis (EDA)
  • 7.1. Categorical vs numeric features
  • 7.2. Datatype conversion
  • 7.3. Sampling
  • 7.4. Data summary and distribution
  • 7.5. Patterns in data
  • 7.6. Data visualization using matplotlib, seaborn, and Bokeh
  • 7.7 Anomaly/outlier detection
**Assignment 6:Patterns in data: Vizualization and data summary
Week 8
3/4/2020
Data Preprocessing
  • 8.1. Basics (select, filter, removal of duplicates)
  • 8.2. Data Transformation
  • 8.3. Standardization, Binning, Missing value treatments
  • 8.4 Balancing dataset
Assignment 6: Data preprocessing
Week 9
3/18/2020
Feature Transformation and Engineering
  • 9.1. Categorical encodings
  • 9.2. Feature creation/engineering
  • 9.3. Feature extraction
Assignment Transformation of categorical and continuous features
Week 10
3/25/2020
Building and Evaluating Models
  • 10.1. Tour of machine learning algorithms using scikit learn
  • 10.2. Introduction to Scikit-learn model development API
  • 10.3. Amazon SageMaker
  • 10.4. Training and fitting classification models
  • 10.5.Training and fitting regression models
  • 10.6. Performance evaluation metrics and curves
Assignment: Model building and evaluation using Scikit-Learn
Week 11
4/1/2020
Best practices in Machine Learning
  • 11.1. Bias vs variance tradeoff
  • 11.2. Train/dev/test dataset
  • 11.3. Regularization
  • 11.4. Learning vs validation curves
  • 11.5. Hyperparameter tuning
  • 11.6. Ensemble learning
  • 11.7. Streamlining workflows with pipelines
Assignment: Regularization, cross validation and hyperparameter tuning
Week 12
4/8/2020
1. Guest Lecture: Data Science at Wells Fargo
2. Discussion on final project status
Quiz 2: Best practices on machine learning
Week 13
4/15/2020
Productionize a Machine Learning model
  • 13.1. Dev/Stage/Prod environment
  • 13.2 Docker , Docker Files, Docker Containers
  • 13.3. Deploy a machine learning model as a Flask app
  • 13.4 Introduction to Airflow
Assignment: Build and deploy a model using Docker and Heroku app
Week 14
4/22/2020
Final Project Demo
Short 5 minutes long individual project demo

About


Languages

Language:Jupyter Notebook 100.0%