DistrictDataLabs / dod-ds-overview

Data Science and Big Data Overview Training

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Data Science and Big Data Overview Training

This repository contains Jupyter notebooks and associated data for the District Data Labs introductory training on data science and big data.

Requirements

Before running any code in this repository, make sure you have installed the requirements (preferably in a virtualenv or conda environment) with:

pip install -r requirements.txt

Notebooks

This repository contains the following notebooks:

  • exploratory_data_analysis.ipynb: a notebook demonstrating basic exploratory data analysis (EDA) techniques using Yelp and U.S. Census data
  • supervised_learning.ipynb: a notebook demonstrating supervised learning techniques on baseball player statistics
  • data_collection.ipynb: a notebook demonstrating data acquisition through webscraping public speeches by the U.S. Secretary of Defense
  • unsupervised_learning.ipynb: a notebook demonstrating unsupervised learning (clustering) on the public speeches referenced above
  • string_matching.ipynb: a notebook demonstrating techniques for entity resolution using string matching
  • elasticsearch_overview.ipynb: a notebook demonstrating how to interact with an Elasticsearch cluster (requires access to Elasticsearch either remotely or locally--using Docker, etc.)

Data

This repository is self-contained: the relevant data for the notebooks is available in /data. There are a number of .csv files. Each of these files provenance is explained in the relevant notebook where it gets used.

About

Data Science and Big Data Overview Training

License:MIT License


Languages

Language:Jupyter Notebook 100.0%