sandtable / docker-cookiecutter-data-science

A fork of the cookiecutter-data-science leveraging Docker for local development.

Home Page:http://drivendata.github.io/cookiecutter-data-science/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Docker Cookiecutter Data Science

Helping Data Science teams easily move to a Docker-first development workflow to iterate and deliver projects faster and more reliably.

New to Docker? Check out this writeup on containers vs virtual machines and how Docker fits in:

https://medium.freecodecamp.org/a-beginner-friendly-introduction-to-containers-vms-and-docker-79a9e3e119b

Cookiecutter is a command-line utility that automatically scaffolds new projects for you based on a template (referred to as cookiecutters):

http://cookiecutter.readthedocs.io/en/latest/readme.html

This cookiecutter is used in conjunction with a base development image available in Docker Hub to provide an out-of-the-box ready environment for many Data Science and Machine Learning project use cases. After running this cookiecutter and the provided start script a developer will have a local development setup that looks like this:

docker local dev

By scaffolding your data science projects using this cookiecutter you will get:

  • Project Docker image built with your own Dockerfile for project specific requirements
  • Docker Compose configuration that dynamically binds to a free host port and forwards to the jupyter server listening port inside the container
  • Shared volume configuration for accessing and executing all your project code inside of the controlled container environment
  • Ability to edit code using your favorite IDE on your host machine and seeing real-time changes to the runtime environment
  • Jupyter notebook fully configured with nb-extensions ready for development and feature engineering
  • Common data science and plotting libraries pre-installed in the container environment to start working immediately

There are several downstream benefits for moving to a container-first workflow in terms of model and inference engine deployment/delivery. By using containers early in the development cycle you can remove a lot of the configuration management issues that waste developer time and ultimately lower quality of deliverables.

Getting Started

  1. Install Docker:
  2. Install Python Cookiecutter package: http://cookiecutter.readthedocs.org/en/latest/installation.html >= 1.4.0
    $ pip install cookiecutter
    It is recommended to set up a central virtualenv or condaenv for cookiecutter and any other "system" wide Python packages you may need.
  3. Run the cookiecutter docker data science template to scaffold your new project:
    $ cookiecutter https://github.com/manifoldai/docker-cookiecutter-data-science.git
  4. Answer all of the cookiecutter prompts for project name, description, license, etc.
  5. Run the start script from the level of your new project directory:
    $ ./start.sh
  6. After the project image builds check which host port is being forwarded to the Jupyter notebook server inside the running container:
    $ docker ps 
  7. Using any browser access your notebook at localhost:{port}
  8. Start working!

For more details on what packages are available pre-installed in the base image see the manifoldai/docker-ml-dev repository page on Docker Hub.

Project Structure

The directory structure of your new project looks like this:

├── LICENSE
├── Dockerfile            <- New project Dockerfile that sources from base ML dev image
├── docker-compose.yml    <- Docker Compose configuration file
├── docker_clean_all.sh   <- Helper script to remove all containers and images from your system
├── start.sh              <- Script to run docker compose and any other project specific initialization steps 
├── Makefile              <- Makefile with commands like `make data` or `make train`
├── README.md             <- The top-level README for developers using this project.
├── data
│   ├── external          <- Data from third party sources.
│   ├── interim           <- Intermediate data that has been transformed.
│   ├── processed         <- The final, canonical data sets for modeling.
│   └── raw               <- The original, immutable data dump.
│
├── docs                  <- A default Sphinx project; see sphinx-doc.org for details
│
├── models                <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks             <- Jupyter notebooks. Naming convention is a number (for ordering),
│                            the creator's initials, and a short `-` delimited description, e.g.
│                            `1.0-jqp-initial-data-exploration`.
│
├── references            <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports               <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures           <- Generated graphics and figures to be used in reporting
│
├── requirements.txt      <- The requirements file for reproducing the analysis environment, e.g.
│                            generated with `pip freeze > requirements.txt`
│
├── src                   <- Source code for use in this project.
│   ├── __init__.py       <- Makes src a Python module
│   │
│   ├── data              <- Scripts to download or generate data
│   │   └── make_dataset.py
│   │
│   ├── features          <- Scripts to turn raw data into features for modeling
│   │   └── build_features.py
│   │
│   ├── models            <- Scripts to train models and then use trained models to make
│   │   │                    predictions
│   │   ├── predict_model.py
│   │   └── train_model.py
│   │
│   └── visualization  <- Scripts to create exploratory and results oriented visualizations
│       └── visualize.py
│
└── tox.ini            <- tox file with settings for running tox; see tox.testrun.org

Video Demo

Torus Demo Youtube

Helpful Resources

Why Did We Build This?

We are trying to bridge the gap that exists between data science and dev/operations teams today. We wrote about it here: https://medium.com/manifold-ai/torus-a-toolkit-for-docker-first-data-science-bddcb4c97b52

Contributing

PRs and feature requests very welcome!

About

A fork of the cookiecutter-data-science leveraging Docker for local development.

http://drivendata.github.io/cookiecutter-data-science/

License:MIT License


Languages

Language:Python 44.4%Language:Makefile 36.8%Language:Batchfile 18.4%Language:Shell 0.4%