markusritschel / cookiecutter-pysci-project

This cookiecutter project template provides you with a boilerplate for small to medium-size (scientific) data projects, e.g. a thesis, a group project, or similar. For an overview of the structure have a look at the section "Project structure".

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Cookiecutter Py(thon)Sci(ence)-Project Template

build License MIT

πŸ‘‰ DISCLAIMER: If you're tired of setting up the same directory and file structure for your new scientific projects again and again, then this might be for you ;-)

This repository holds a "template" of a directory structure for small to medium-sized scientific projects, making use of CookieCutter, a templating engine for project structures. Check out the links at the bottom of the page to create your own CookieCutter or use this one to start your project. Also, feel free to fork the repository and adjust it to your own needs.


Table of contents


What is it good for? or How this can boost your productivity

CookieCutter is a templating engine for creating directory structures including pre-defined files based on a question catalog that is being asked during the setup.
By running cookiecutter with this repository, a new directory will be created with a pre-defined structure and some basic files, making you all ready to start a new scientific Python project without having to manually create the same files & structure over and over again. This includes

  • code that is importable from every place in your environment
  • automatically resolved paths to the project's root and the directories for data, plots, logs, etc.
  • make commands to run automated unit tests, create documentation of your code, etc.
  • creating a nice HTML representation of your Jupyter notebooks and of your doc strings
  • and so on... πŸš€

It is indeed so easy:
cookiecutter

Note:

CookieCutter seems to be on hiatus at the moment, meaning that the maintainers are not working on it anymore. While it should still work, you can also switch to the fork CookieNinja, which works with cookiecutters (project templates) the same way as the original CookieCutter does.

About this template

There exist tons of different CookieCutter templates for all different kinds of projects. However, according to my experience, many of them are very complex in their structure and therefore often a bit overkill, especially for newcomers or projects of a rather modest size.
This template provides a boilerplate for small to medium-size (scientific) data projects, e.g. a thesis, a group project, or similar. For an overview of the directory & file structure have a look at the section further below. The redundant parts (mainly for demonstration purposes) are only a few and are listed in the section after the one describing the project structure.

πŸ‘‰ Once set up, a Git repository is automatically initialized. If you want to connect it with a remote repository on GitHub (or any other hosted git service) you need to add the respective remote repository to your local repository.

Requirements

You need to have Python installed as well as the Python package cookiecutter. You can do this either via pip or conda.

$ pip install -U cookiecutter cruft
$ conda install -c conda-forge cookiecutter cruft

As mentioned above, instead of cookiecutter you can also use cookieninja. Besides that, there is no need to clone or download anything from this repository. Just follow the next step :-)

πŸ‘‰ Important hint: I recommend you to install Mamba as a package manager. It is built on conda but has a much greater performance.

Usage

If you plan to use Git as a version control system, ensure that you have installed it on your machine and that you have specified the global Git configuration settings (this needs to be done only once):

$ git config --global user.name "John Doe"
$ git config --global user.email johndoe@example.com

Setting up a new project

Then, after having cookiecutter installed, create a new project from this template by executing one of the following commands:

$ cookiecutter gh:markusritschel/cookiecutter-pysci-project
$ cookiecutter https://github.com/markusritschel/cookiecutter-pysci-project.git
$ cookiecutter git+https://github.com/markusritschel/cookiecutter-pysci-project
$ cookiecutter git+ssh://git@github.com/markusritschel/cookiecutter-pysci-project.git

The script will ask you some questions based on the entries in the cookiecutter.json and will then create a new project based on this template with the information you provided by answering the questions. Finalize the step by changing to the new directory.

πŸ‘‰ For the following steps, there are also Make commands available that ease the work for you.

Then, for development, I strongly recommend you create a dedicated virtual environment. Using conda, you can simply execute conda create -n <your-environment-name> or create an environment based on the environment.yml file by executing conda create -f environment.yml. The latter would create a virtual conda environment with the same name as your project directory. You can override the default name of the environment with the option -n <your-custom-name>. This should also install all the required packages that you need to make your new project work, including generating the documentation.

You finalize the setup by activating the environment via conda activate <your-environment-name>.

πŸ‘‰ You can use the pre-defined commands make setup-conda-env and make src-available instead of performing the above steps manually. See also the README.md of your new project.

For further information, you may wanna have a look at the README.md file in the root directory of your new project. This will give you more information about setting up the project, making your code installable, etc. You may also want to check out the Makefile commands (simply type make to get an overview of the available commands).

Using the Makefile

The Makefile in the project directory provides some default routines like cleanup, testing, installing requirements, etc.
Even though for many people using make may seem a bit old-fashioned, I would recommend you have a look at Make's great capability of dealing with dependencies. This is particularly useful if, for example, the first step in your data-processing pipeline takes a long time to process your raw data and generate the interim product.
I usually structure my data-processing workflow such that I can run a single process via the command line (for example python scripts/process-raw-data.py -o ./output_dir). (The Python packages click, fire, and docopt provide neat functionalities to convert your scripts into interactive command-line interfaces.) These commands I can then set as targets in the Makefile, for example:

## Process raw data and write the newly generated data into ./data/interim/
process_raw_data:
    python scripts/process-raw-data.py

I can now simply run make process_raw_data in the project's root directory.

Setting dependencies

Let's assume that the previous step (processing the raw data) generates new data inside ./data/interim/. If now I have a second processing step that depends on the data generated by the previous step, I can set these data as dependencies for the new rule:

## Process interim data
process_interim_data: $(wildcard data/interim/**/*)
    python scripts/process-interim-data.py

This way, the last step is only executed if the data it depends on has changed since the last time of execution.

For further information, have a look at Make's documentation: https://www.gnu.org/software/make/manual/html_node/Rules.html

Snakemake

Going one step further, in addition or as an alternative to make, Snakemake provides even more extended functionalities. The syntax used in Snakemake is pure Python, making it very convenient to work with and providing all the functionality of Python in your Snakemake workflow. In Snakemake you define dependencies not as an "artificial" target but you indicate the target file you want to create, and Snakemake takes care of producing the required dependencies. Another strength of Snakemake is that it is easily scalable. Porting your Snakemake workflow from your local machine to a High-Performance Computing system is as straightforward as adding a few extra parameters to the executed command. This way, Snakemake automatically generates bash scripts and submits them as jobs on the HPC, automatically distributing the tasks of the workflow.

Write your documentation

In my opinion, it is helpful to differentiate between two kinds of documentation:

  1. The first kind should document your code (similar to what you would expect when opening the online documentation of a Python package or similar) and should be considered as best practice to be shipped with your code. This includes docstrings as part of your functions and some example cases for explaining the usage.
  2. The second is about what your project is about, showcasing results etc. This might be the basis for a scientific paper, a thesis, or a report of the project. It can be either part of the "official" project documentation, or you create a separate documentation structure explicitly for this purpose (e.g. in the directory ./reports/book/). (Hint: consider writing this kind of documentation in a separate branch.)

For the first, I would recommend you to use Sphinx, which is particularly suited for documenting Python code and is already set up as the default doc engine in this project template. Its autodoc extension can also parse the doc strings of your code and process them to nice HTML output. For more details, see the section below.

For the second purpose, choose whichever tool you like the most (Sphinx, MkDocs, Jekyll, Jupyter-Book, MyST-MD etc.). I personally like Jupyter-Book very much as it is feature-rich and you can use a bunch of languages: Jupyter Markdown, MyST Markdown for more publishing features, reStructuredText, even your Jupyter notebooks. Jupyter-Book is very suitable for technical documentations that include Jupyter notebooks.

MyST-MD, on the other hand, has its focus more on scientific publications.

Using Sphinx

In short: describe as much of your code as possible in the doc-strings of your functions, classes, and modules. Sphinx can then parse these doc strings and format them nicely in your documentation output. To compile an HTML report of your Sphinx documentation locally, enter the docsrc directory and execute make html (type make for more formats). Alternatively, you can run make docs from the root of your project. For a detailed description of how to use Sphinx and how to write your documentation check out their website.

Publish your documentation with Github pages πŸš€

GitHub allows you to host static websites on its platform. In this project template, I have integrated a workflow for automatic deployment. The only thing you need to prepare is to go into your repository's settings (on Github), go to Pages and then select "Deploy from a branch" for Source and under Branch select "gh-pages" and "root". Save your changes.
Now, whenever you push something to the main branch, your documentation in docsrc will be automatically compiled and deployed. The result will be available on https://<your-github-username>.github.io/<your-project-name>. Magic… πŸͺ„πŸ˜‰

Note
Keep in mind that the deployment may take a while. You can check the status of the workflow by clicking on "Action" in the menu bar of your repository.

Using Jupyter-Book

To compile your Jupyter book, simply execute jb build reports/book or make book from the root of your project. Alternatively to your source code documentation, you can also place the content of your compiled Jupyter book to docs/ to publish it via Github pages.

Using both a (Jupyter book) report alongside your code documentation as Github page

Github pages allows only one website per repository. Usually that can be accessed via the domain https://your-github-usernam.github.io/your-project. To use both your project HTML report (Jupyter book) and your technical code documentation, you can merge the two compiled HTML outputs. For example, to have your project report as the main site on your repository's domain, put the content of your compiled Jupyter book (found in reports/book/_build/html) in ./docs (inside the repository's root) and put the Sphinx-compiled code documentation (found in docsrc/_build/html) into a subfolder of .docs/ (e.g. ./docs/code-documentation). Then, your project report will be found on the repository's GitHub page (https://your-github-usernam.github.io/your-project) and your code documentation on https://your-github-usernam.github.io/your-project/code-documentation, respectively. You could then link your code documentation on your jupyter-book page or make the link somewhere else available.

πŸ“” Side note
Jupyter Book can also be a nice way to share your collection of Jupyter notebooks online. There are plugins that allow people to comment on the rendered HTML representation of your notebooks.

Some tips and thoughts regarding the code layout

All scripts and Jupyter notebooks that deal with either processing of the data or the creation of any kind of reports (plots, documents, etc) should reside in the scripts/ and notebooks/ directories, respectively.
Code under src/ is exclusively source code (i.e. low-level code) and is not directly run.
Name scripts and notebooks in a way that indicates their order of execution (examples can be found in the respective directories). It is also recommended to have one script for each task, i.e. the creation of one figure or one table.

⚠️ A note on version controlling Jupyter notebooks
It is very ugly to keep Jupyter Notebooks under version control as they are in principle a very large JSON file, containing lots of metadata, output of your cells, etc. This circumstance makes it also quite hard to collaborate on them. However, a while ago I stumbled upon Jupytext which syncs your Jupyter notebooks with another file for which you can choose a variety of formats (e.g. Markdown, R Markdown, normal python, etc.). These "paired" files, which can either reside alongside your Jupyter notebooks or in a separate directory, can then be easily version-controlled. Jupytext can either be used from the command line (jupytext --sync notebooks/*ipynb) or as a Jupyter plugin. For more information, have a look at their documentation site.

Project Structure

β”œβ”€β”€ assets             <- A place for assets like shapefiles or config files
β”‚
β”œβ”€β”€ data               <- Contains all data used for the analyses in this project.
β”‚   β”‚                     The sub-directories can be links to the actual location of your data.
β”‚   β”‚                     However, they should never be under version control! (-> .gitignore)
β”‚   β”œβ”€β”€ interim        <- Intermediate data that have been transformed from the raw data
β”‚   β”œβ”€β”€ processed      <- The final, processed data used for the actual analyses
β”‚   └── raw            <- The original, immutable(!) data
β”‚
β”œβ”€β”€ docsrc             <- The technical documentation (default engine: Sphinx; but feel free to 
β”‚                         use MkDocs, Jupyter-Book or anything similar).
β”‚                         This should contain only documentation of the code and the assets.
β”‚                         A report of the actual project should be placed in `reports/book`.
β”‚
β”œβ”€β”€ logs               <- Storage location for the log files being generated by scripts
β”‚
β”œβ”€β”€ notebooks          <- Jupyter Notebooks. Follow a naming convention, such as a number (for ordering),
β”‚   β”‚                     and a short `-` or `_` delimited description, e.g. `01-initial-analyses`
β”‚   β”œβ”€β”€ _paired        <- Optional location for your paired Jupyter Notebook files
β”‚   β”œβ”€β”€ exploratory    <- Notebooks for exploratory tasks
β”‚   └── reports        <- Notebooks generating reports and figures
β”‚
β”œβ”€β”€ references         <- Data descriptions, manuals, and all other explanatory materials
β”‚
β”œβ”€β”€ reports            <- Generated reports (e.g. HTML, PDF, LaTeX, etc.)
β”‚   β”œβ”€β”€ book           <- A Jupyter-Book describing the project
β”‚   └── figures        <- Generated graphics and figures to be used in reporting
β”‚
β”œβ”€β”€ setup.py           <- makes project pip installable (pip install -e .) so src can be imported
β”œβ”€β”€ scripts            <- High-level scripts that use (low-level) source code from `src/`
β”œβ”€β”€ src                <- Source code (and only source code!) for use in this project
β”‚   β”œβ”€β”€ tests          <- Contains tests for the code in `src/`
β”‚   └── __init__.py    <- Makes src a Python module and provides some standard variables
β”‚
β”œβ”€β”€ .env               <- In this file, specify all your custom environment variables
β”‚                         Keep this out of version control! (i.e. have it in your .gitignore)
β”œβ”€β”€ .gitignore         <- Here, list all the files and folders (patterns allowed) that you want to
β”‚                         keep out of git version control.    
β”œβ”€β”€ CHANGELOG.md       <- All major changes should go in there
β”œβ”€β”€ jupytext.toml      <- Configuration file for Jupytext
β”œβ”€β”€ LICENSE            <- The license used for this project
β”œβ”€β”€ Makefile           <- A self-documenting Makefile for standard CLI tasks
β”œβ”€β”€ README.md          <- The top-level README of this project
β”œβ”€β”€ requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
β”‚                         generated with `pip freeze > requirements.txt`
β”‚
└── setup.py           <- Setup python file to install your source code in your (virtual) python 
                          environment

Dummy files

The following files are for demonstration purposes only and, if not needed, can be deleted safely:

β”œβ”€β”€ notebooks/01-minimal-example.ipynb
β”œβ”€β”€ docsrc/*
β”œβ”€β”€ reports/book/*
β”œβ”€β”€ scripts/01-test.py
└── src
    β”œβ”€β”€ tests/*
    └── submodule.py

Sources of inspiration

Some great sources of inspiration and orientation when I created this template:

Contributing

Issues & pull requests accepted.


Β© Markus Ritschel, 2023

About

This cookiecutter project template provides you with a boilerplate for small to medium-size (scientific) data projects, e.g. a thesis, a group project, or similar. For an overview of the structure have a look at the section "Project structure".

License:MIT License


Languages

Language:Python 57.4%Language:Makefile 25.4%Language:Jupyter Notebook 12.9%Language:Batchfile 2.8%Language:CSS 1.5%Language:TeX 0.0%