AmiiThinks / cookiecutter-data-science

A logical, reasonably standardized, but flexible project structure for doing and sharing data science work.

Home Page:http://drivendata.github.io/cookiecutter-data-science/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Cookiecutter Data Science

A logical, reasonably standardized, but flexible project structure for doing and sharing data science work.

Standard Naming Convention

Files should be named according to the person who is creating them, version numbers, and the purpose. For example:

  • notebooks/ak-1.2-DataPreprocessing.py
  • data/interim/ak-1.2-DataPreprocessing.csv
  • logs/ak-1.2-DataPreprocessing/GENERATED_FILES

Your initials (preferably first two), the notebook number (in chronological order of creation) and version number (incremented any time you make changes that would effect the output, and the type of file.

Common file names

  • DataPreprocessing - manipulating the raw data files into standard Pandas Dataframe form
  • DataExploration - for looking at the contents of the data, usually raw but could be other
  • ModelX - running a particular model
  • ParameterTuning - sweeping across hyperparameters
  • Scratch - personal playground, not expected to be production ready at any point

Requirements to use the cookiecutter template:


  • Python 2.7 or 3
  • Cookiecutter Python package >= 1.4.0: This can be installed with pip by or conda depending on how you manage your Python packages:
$ pip install cookiecutter

or

$ conda config --add channels conda-forge
$ conda install cookiecutter

To start a new project, run:


cookiecutter https://github.com/AmiiThinks/cookiecutter-data-science

asciicast

The resulting directory structure


The directory structure of your new project looks like this:

├── LICENSE
├── Makefile           <- Makefile with commands like `make data` or `make train`
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── external       <- Data from third party sources.
│       └── README.md  <- Link or instructions on acquiring the data
│   ├── interim        <- Intermediate data that has been transformed, use standard naming convention.
│   ├── processed      <- The final, canonical data sets for modeling, snc refers to relevant notebooks.
│   └── raw            <- The original, immutable data dump.
│       └── README.md  <- Link or instructions on acquiring the data
│
├── docs               <- A default Sphinx project; see sphinx-doc.org for details
│
├── logs               <- Stores details of experiments, learning curves from custom tools
│   └── ID-#-nb        <- Directory referencing the notebook or commit used to create logs.
|
├── models             <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks          <- Jupyter notebooks. Naming convention is the creator's initials,
│                         a number (for ordering), and a CamelCase description, e.g.
│                         `ak-1.0-DataPreprocessing.ipynb`.
│
├── references         <- Relevant papers, Data dictionaries, manuals, and all other explanatory materials.
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
│
├── src                <- Source code for use in this project, the ultimate deliverable
│   ├── __init__.py    <- Makes src a Python module
│   │
│   ├── data           <- Scripts to download or generate data
│   │   └── make_dataset.py
│   │
│   ├── features       <- Scripts to turn raw data into features for modeling
│   │   └── build_features.py
│   │
│   ├── models         <- Scripts to train models and then use trained models to make
│   │   │                 predictions
│   │   ├── predict_model.py
│   │   └── train_model.py
│   │
│   └── visualization  <- Scripts to create exploratory and results oriented visualizations
│       └── visualize.py
│
└── tox.ini            <- tox file with settings for running tox; see tox.testrun.org

Contributing

We welcome contributions! See the docs for guidelines.

Installing development requirements


pip install -r requirements.txt

Running the tests


py.test tests

About

A logical, reasonably standardized, but flexible project structure for doing and sharing data science work.

http://drivendata.github.io/cookiecutter-data-science/

License:MIT License


Languages

Language:Python 44.6%Language:Makefile 37.0%Language:Batchfile 18.5%