SF DAT15 Course Repository

Course materials for General Assembly's Data Science course in San Francisco, DC (6/15/15 - 8/26/15).

Instructors: Sinan Ozdemir (who is awesome)

Teaching Assistants: Liam Foley, Patrick Foley, and Ramesh Sampath (who are all way more awesome)

Office hours: All will be held in the student center at GA, 225 Bush Street

  • Monday 5:15-6:15pm
  • Tuesday 6:30-8:30pm
  • Wednesday 5:15-6:15pm
  • Friday 12:30-2:30pm
  • Saturday 10:00am-12:00pm

Course Project information

Monday Wednesday
6/15: Introduction / Expectations / Git Intro 6/17: Python
6/22: Data Science Workflow / Pandas 6/24: More Pandas
6/29: Intro to ML / Numpy / KNN 7/1: Scikit-learn / Model Evaluation
Project Milestone: Question and Data Set
HW Homework 1 Due
7/6: Linear Regression 7/8: Logistic Regression
7/13: Working on a Data Problem 7/15: Clustering
7/20: Natural Language Processing 7/22: Naive Bayes
Milestone: First Draft Due
7/27: Decision Trees 7/29:Ensembling Techniques
8/3: Recommendation Engines
Milestone: Peer Review Due
8/5: Databases / MapReduce
8/10: Dimension Reduction 8/12: Ensemble Techniques
8/17: Web Development with Flask 8/17: Neural Networks
8/24: Projects 8/26: Projects

Installation and Setup

  • Install the Anaconda distribution of Python 2.7x.
  • Install Git and create a GitHub account.
  • Once you receive an email invitation from Slack, join our "SF_DAT_15 team" and add your photo!


Class 1: Introduction / Expectations / Git Intro

  • Introduction to General Assembly
  • Course overview: our philosophy and expectations (slides)
  • Git overview: (slides)
  • Tools: check for proper setup of Git, Anaconda, overview of Slack


  • Resolve any installation issues before next class.
  • Make sure you have a github profile and created a repo called "SF_DAT_15"
  • Clone the class repo (this one!)
  • Review this code for a recap of some Python basics.


Class 2: Python

  • Brief overview of Python environments: Python interpreter, IPython interpreter, Spyder, Rodeo
  • Python quiz (code)
  • Check out some iPython Notebooks!
  • Working with data in Python in Spyder
  • Lab on files and API usage




Class 3: Data Science Workflow / Pandas


  • Slides on the Data Science workflow here
    • Data Science Workflow
  • Intro to Pandas walkthrough here
    • I will give you semi-cleaned data allowing us to work on step 3 of the data science workflow
    • Pandas is an excellent tool for exploratory data analysis
    • It allows us to easily manipulate, graph, and visualize basic statistics and elements of our data
    • Pandas Lab!


  • Begin thinking about potential projects that you'd want to work on. Consider the problems discussed in class today (we will see more next time and next Monday as well)
    • Do you want a predictive model?
    • Do you want to cluster similar objects (like words or other)?


Class 4 - More Pandas


  • Class code on Pandas here
  • We will work with 3 different data sets today:
  • Pandas Lab! here


  • Please review the readme for the first homework. It is due NEXT Wednesday (7/1/2015)
  • The one-pager for your project is also due. Please see project guidelines

Class 5 - Intro to ML / Numpy / KNN


  • Intro to numpy code
    • Numerical Python, code adapted from tutorial here
    • Special attention to the idea of the np.array
  • Intro to Machine Learning and KNN slides
    • Supervised vs Unsupervised Learning
    • Regression vs. Classification
  • Iris pre-work code and code solutions
    • Using numpy to investigate the iris dataset further
    • Understanding how humans learn so that we can teach the machine!
  • Lab to create our own KNN model


  • The one page project milestone as well as the pandas homework!
  • Read this excellent article, Understanding the Bias-Variance Tradeoff, and be prepared to discuss it in class on Wednesday. (You can ignore sections 4.2 and 4.3.) Here are some questions to think about while you read:
    • In the Party Registration example, what are the features? What is the response? Is this a regression or classification problem?
    • In the interactive visualization, try using different values for K across different sets of training data. What value of K do you think is "best"? How do you define "best"?
    • In the visualization, what do the lighter colors versus the darker colors mean? How is the darkness calculated?
    • How does the choice of K affect model bias? How about variance?
    • As you experiment with K and generate new training data, how can you "see" high versus low variance? How can you "see" high versus low bias?
    • Why should we care about variance at all? Shouldn't we just minimize bias and ignore variance?
    • Does a high value for K cause over-fitting or under-fitting?


  • For a more in-depth look at machine learning, read section 2.1 (14 pages) of Hastie and Tibshirani's excellent book, An Introduction to Statistical Learning. (It's a free PDF download!)

    Class 6: scikit-learn, Model Evaluation Procedures

  • Introduction to scikit-learn with iris data (code)

  • Exploring the scikit-learn documentation: user guide, module reference, class documentation

  • Discuss the article on the bias-variance tradeoff

  • Look as some code on the bias variace tradeoff

    • To run this, I use a module called "seaborn"
    • To install to anywhere in your terminal (git bash) and type in sudo pip install seaborn
  • Model evaluation procedures (slides, code)



  • Practice what we learned in class today!
    • If you have gathered your project data already: Try using KNN for classification, and then evaluate your model. Don't worry about using all of your features, just focus on getting the end-to-end process working in scikit-learn. (Even if your project is regression instead of classification, you can easily convert a regression problem into a classification problem by converting numerical ranges into categories.)
    • If you don't yet have your project data: Pick a suitable dataset from the UCI Machine Learning Repository, try using KNN for classification, and evaluate your model. The Glass Identification Data Set is a good one to start with.
    • Either way, you can submit your commented code to your SF_DAT_15_WORK, and we'll give you feedback.


Class 7: Linear Regression



Class 8: Logistic Regression



Class 9: Working a Data Problem

  • Today we will work on a real world data problem! Our data is stock data over 7 months of a fictional company ZYX including twitter sentiment, volume and stock price. Our goal is to create a predictive model that predicts forward returns.

  • Project overview (slides)

    • Be sure to read documentation thoroughly and ask questions! We may not have included all of the information you need...

Class 10: Clustering and Visualization

  • The slides today will focus on our first look at unsupervised learning, K-Means Clustering!
  • The code for today focuses on two main examples:
    • We will investigate simple clustering using the iris data set.
    • We will take a look at a harder example, using Pandora songs as data. See data. See code here
    • Checking out some of the limitations of K-Means Clutering here


  • HW2 and Project Milestone 2 are due in one week!
  • Download all of the NLTK collections.
    • In Python, use the following commands to bring up the download menu.
    • import nltk
    • Choose "all".
    • Alternatively, just type'all')
  • Install two new packages: textblob and lda.
    • Open a terminal or command prompt.
    • Type pip install textblob and pip install lda.


##Class 11: Natural Language Processing


  • Naural Language Processing is the science of turning words and sentences into data and numbers. Today we will be exploring techniques into this field
  • code showing topics in NLP
  • lab analyzing tweets about the stock market


  • Read Paul Graham's A Plan for Spam and be prepared to discuss it in class on Wednesday. Here are some questions to think about while you read:
    • Should a spam filter optimize for sensitivity or specificity, in Paul's opinion?
    • Before he tried the "statistical approach" to spam filtering, what was his approach?
    • How exactly does his statistical filtering system work?
    • What did Paul say were some of the benefits of the statistical approach?
    • How good was his prediction of the "spam of the future"?
  • Below are the foundational topics upon which Wednesday's class will depend. Please review these materials before class:
    • Confusion matrix: a good guide roughly mirrors the lecture from class 10.
    • Sensitivity and specificity: Rahul Patwari has an excellent video (9 minutes).
    • Basics of probability: These introductory slides (from the OpenIntro Statistics textbook) are quite good and include integrated quizzes. Pay specific attention to these terms: probability, sample space, mutually exclusive, independent.
  • You should definitely be working on your project! First draft and HW2 are both due Wednesday!!

##Class 12: Naive Bayes Classifier

Today we are going over advanced metrics for classification models and learning a brand new classification model called naive bayes!


  • Learn about ROC/AUC curves
  • Learn the Naive Bayes Classifier
    • Slides here
    • Code here
    • In the code file above we will create our own spam classifier!


##Class 13: Decision Trees

We will look into a slightly more complex model today, the Decision Tree.



  • Project reviews due August 3rd!


  • Chapter 8.1 of An Introduction to Statistical Learning also covers the basics of Classification and Regression Trees
  • The scikit-learn documentation has a nice summary of the strengths and weaknesses of Trees.
  • For those of you with background in javascript, d3.js has a nice tree layout that would make more presentable tree diagrams:
    • Here is a link to a static version, as well as a link to a dynamic version with collapsable nodes.
    • If this is something you are interested in, Gary Sieling wrote a nice function in python to take the output of a scikit-learn tree and convert into json format.
    • If you are intersted in learning d3.js, this a good tutorial for understanding the building blocks of a decision tree. Here is another tutorial focusing on building a tree diagram in d3.js.
  • Dr. Justin Esarey from Rice University has a nice video lecture on CART that also includes an R code walkthrough

Class 15: Recommenders

  • Recommendation Engines slides
  • Recommendation Engine Example code


Class 16: Databases and Mapreduce

Class 17: Dimension Reduction


  • Some hardcore math in python here
  • PCA using the iris data set here and with 2 components here
  • PCA step by step here
  • Check out Pyxley for our guest speaker's (Nick Kridler) talk on Wednesday

Class 18: Ensembling


Class 19: Web Development

  • slides here
  • We will be working with the flask app found here


  • MVC Architecture blog post
  • More on using Flask and Heroku here (Note you can ignore the virtual environment stuff, unless you want a challenge!)


  • Try to deploy your own ML model to Heroku!
  • Read an intro to Neural Networks here
  • And this intro to SVM

Class 20: Neural Networks and SVM



##Project Info

  • Everyone will have a maximum of 15 minutes to present including Q&A
  • Please sign up for a slot if you haven't done so here
    • If you don't want to be crunched for time, try going on Monday :)
    • If you don't sign up by end of class today (Wednesday 8/19) we will assign you a slot
  • Final Projects are mandatory if you want a certification of completion from General Assembly
  • Remember you must submit both a presentation as well as a write up (What a write up you never mentioned that!) I did and also it is in the project requirements :)


