Sustainable (Small) Data Science

Exploring practices for sustainable (small) data science.

This document was developed as part of an IT Workshop for the Stanford Graduate School of Education.

Useful tips, tools, and processes

The following list provides an opiniated overview of useful tools and processes that can help to setup a sustainable small data science project. The assumptions are that individual researchers or small teams will be responisble for the various kinds of tasks that are involved in these kinds of undertakings.

These tasks are categorized by three distinct project phases which usually involve different types of considerations and engagements with collaborators, readers, and users.

Initiation phase: Setting up project/data structures
Ongoing phase: Development and analysis
Completion phase: Publishing and dissemination

Initiation phase

Project structure

Cookiecutter template
My own project structure (provide examples)
1. Data, notebooks, scripts, outputs
Separate data work from scholarly articles
1. Usual published project structure:
  1. Versioned code repository with DOI (Github + Zenodo)
  2. Versioned data repository with DOI (Dataverse)
  3. Versioned repository with code/data to reproduce article (Github + Zenodo)

Ongoing phase

Utility

Progress bars: tqdm
APIs: Postman

Development

Dependency management: Poetry, pyenv, pipx
Leverage notebooks and interactive development environments 5. I can provide a deep dive into my local dev setup. But maybe not relevant for non-pythonistas
Serious development in notebooks: nbdev

Research process

Github Wiki as a research log

Completion phase

Collaboration

Turn notebooks into slides: RISE
Github Pages

About

Exploring practices for sustainable (small) data science

Creative Commons Zero v1.0 Universal

Languages

Language:HTML 98.6%Language:Jupyter Notebook 1.4%