Course materials for General Assembly's Data Science course in Washington, DC (12/15/14 - 3/16/15). View student work in the student repository.
Instructors: Sinan Ozdemir and Kevin Markham. Teaching Assistant: Brandon Burroughs.
Office hours: 1-3pm on Saturday and Sunday (Starbucks at 15th & K), 5:15-6:30pm on Monday (GA)
Monday | Wednesday |
---|---|
12/15: Introduction | 12/17: Python |
12/22: Getting Data | 12/24: No Class |
12/29: No Class | 12/31: No Class |
1/5: Git and GitHub | 1/7: Pandas Milestone: Question and Data Set |
1/12: Numpy, Machine Learning, KNN | 1/14: scikit-learn, Model Evaluation Procedures |
1/19: No Class | 1/21: Linear Regression |
1/26: Logistic Regression, Preview of Other Models |
1/28: Model Evaluation Metrics Milestone: Data Exploration and Analysis Plan |
2/2: Working a Data Problem | 2/4: Clustering and Visualization Milestone: Deadline for Topic Changes |
2/9: Naive Bayes | 2/11: Natural Language Processing |
2/16: No Class | 2/18: Decision Trees and Ensembles Milestone: First Draft |
2/23: Advanced scikit-learn | 2/25: Databases and MapReduce |
3/2: Recommenders | 3/4: Course Review, Companion Tools Milestone: Second Draft (Optional) |
3/9: TBD | 3/11: Project Presentations |
3/16: Project Presentations |
- Install the Anaconda distribution of Python 2.7x.
- Install Git and create a GitHub account.
- Once you receive an email invitation from Slack, join our "DAT4 team" and add your photo!
- Introduction to General Assembly
- Course overview: our philosophy and expectations (slides)
- Data science overview (slides)
- Tools: check for proper setup of Anaconda, overview of Slack
Homework:
- Resolve any installation issues before next class.
Optional:
- Review the code from Saturday's Python refresher for a recap of some Python basics.
- Read Analyzing the Analyzers for a useful look at the different types of data scientists.
- Subscribe to the Data Community DC newsletter or check out their event calendar to become acquainted with the local data community.
- Brief overview of Python environments: Python interpreter, IPython interpreter, Spyder
- Python quiz (solution)
- Working with data in Python
- Obtain data from a public data source
- FiveThirtyEight alcohol data, and revised data (continent column added)
- Reading and writing files in Python (code)
Homework:
- Python exercise
- Read through the project page in detail.
- Review a few projects from past Data Science courses to get a sense of the variety and scope of student projects.
- Check for proper setup of Git by running
git clone https://github.com/justmarkham/DAT-project-examples.git
- Check for proper setup of Git by running
Optional:
- If you need more practice with Python, review the "Python Overview" section of A Crash Course in Python, work through some of Codecademy's Python course, or work through Google's Python Class and its exercises.
- For more project inspiration, browse the student projects from Andrew Ng's Machine Learning course at Stanford.
Resources:
- Online Python Tutor is useful for visualizing (and debugging) your code.
- Checking your homework
- Regular expressions, web scraping, APIs (slides, regex code, web scraping and API code)
- Any questions about the course project?
Homework:
- Think about your project question, and start looking for data that will help you to answer your question.
- Prepare for our next class on Git and GitHub:
- You'll need to know some command line basics, so please work through GA's excellent command line tutorial and then take this brief quiz.
- Check for proper setup of Git by running
git clone https://github.com/justmarkham/DAT-project-examples.git
. If that doesn't work, you probably need to install Git. - Create a GitHub account. (You don't need to download anything from GitHub.)
Optional:
- If you aren't feeling comfortable with the Python we've done so far, keep practicing using the resources above!
Resources:
- regex101 is an excellent tool for testing your regular expressions. For learning more regular expressions, Google's Python Class includes an excellent regex lesson (which includes a video).
- Mashape and Apigee allow you to explore tons of different APIs. Alternatively, a Python API wrapper is available for many popular APIs.
- Special guest: Nick DePrey presenting his class project from DAT2
- Git and GitHub (slides)
Homework:
- Project milestone: Submit your question and data set to your folder in DAT4-students before class on Wednesday! (This is a great opportunity to practice writing Markdown and creating a pull request.)
Optional:
- Clone this repo (DAT4) for easy access to the course files.
Resources:
- Read the first two chapters of Pro Git to gain a much deeper understanding of version control and basic Git commands.
- GitRef is an excellent reference guide for Git commands.
- Git quick reference for beginners is a shorter reference guide with commands grouped by workflow.
- The Markdown Cheatsheet covers standard Markdown and a bit of "GitHub Flavored Markdown."
- Pandas for data exploration, analysis, and visualization (code)
- Split-Apply-Combine pattern
- Simple examples of joins in Pandas
Homework:
- Read through this excellent example of data wrangling and exploration in Pandas.
Optional:
- To learn more Pandas, review this three-part tutorial, or review these three excellent (but extremely long) notebooks on Pandas: introduction, data wrangling, and plotting.
Resources:
- For more on Pandas plotting, read the visualization page from the official Pandas documentation.
- To learn how to customize your plots further, browse through this notebook on matplotlib.
- To explore different types of visualizations and when to use them, Choosing a Good Chart is a handy one-page reference, and Columbia's Data Mining class has an excellent slide deck.