This is a project template powered by Cookiecutter for use with datakit-project.
.
├── .gitignore
├── README.md
├── analysis
│ └── archive
├── data
│ ├── documentation
│ ├── handmade
│ ├── html_reports
│ ├── processed
│ ├── public
│ └── source
├── etl
├── publish
├── scratch
├── viz
.gitignore
- Ignores a few typical temporary/unnecessary files common to most data projects.
README.md
- Project-specific readme with boilerplate for data projects.
- Includes sourcing details and places to explain how to replicate/remake the project.
analysis
- Code that involves analysis on already-cleaned data. Code for cleaning data should go in
etl
. - Multiple analysis files are numbered sequentially.
- If we are sharing the data, last analysis script is called make_dw_files.[R,py] to write_csv to public folder.
analysis/archive
- Any analyses for story threads that are no longer being investigated are placed here for reference.
- Code that involves analysis on already-cleaned data. Code for cleaning data should go in
data
- This is the directory used with our
datakit-data
plugin. data/documentation
- Documentation on data files should go here - data dictionaries, manuals, interview notes.
data/handmade
- Manually created data sets by reporters go here.
data/html_reports
- Any HTML reports or pages generated by code should go here.
data/processed
- Data that has been processed by scripts in this project and is clean and ready for analysis goes here.
data/public
- Public-facing data files (i.e., final datasets we share with reporters/make accessible) go here - data files which are 'live'.
data/source
- Original data from sources goes here.
- This is the directory used with our
etl
- ETL (extract, transform, load) scripts for reading in source data and cleaning and standardizing it to prepare for analysis go here.
- Multiple etl files are numbered.
- Joins are included in etl process.
- Last step of ETL process is to output an [RDS,Pickle] file to data/processed.
- naming convention: etl_WHATEVERNAME.[rds,pkl]
- ETL (extract, transform, load) scripts for reading in source data and cleaning and standardizing it to prepare for analysis go here.
publish
- This directory holds all documents in the project that will be public facing (e.g. data.world documents).
scratch
- This directory contains scratch materials that will not be used in the project at the end.
- Common cases are filtered tables or quick visualizations for reporters.
- This directory is not tracked in git.
viz
- Graphics and visualization development specific work such as web interactive code should go here.
You will need to clone this repository to ~/.cookiecutters/
(make the directory if it doesn't exist):
cd path/to/.cookiecutters
git clone git@github.com:associatedpress/cookiecutter-generic-project
Then, use datakit project
:
datakit project create --template cookiecutter-generic-project
If you'd like to avoid specifying the template each time, you can edit ~/.datakit/plugins/datakit-project/config.json
to use this template by default:
{"default_template": "/Users/lfenn/.cookiecutters/cookiecutter-generic-project"}
Dependencies:
- UV:
curl -LsSf https://astral.sh/uv/install.sh | sh
The UV project template aims to address a few pain points:
-
JupyterLab/Notebook installations are often unwieldy. It's not uncommon to think you're using a kernel you installed yourself, only to find that a package you've recently installed is using a kernel somewhere else. We try to keep Jupyter installs contained in each project within virtualenv's .venv folder.
-
We don't want to choose between restricting our analysis notebooks to one location or memorizing relative paths to data so we can put the notebooks wherever we want. We fix this by telling our ipython kernels to act like they're at the root of the project, no matter where the notebook files are in the project (similar behavior to RStudio).
-
We want a package manager, not just a dump to requirements.txt. This template uses UV.
-
Jupyter notebooks are complicated and not appropriate for git tracking. Simply opening a notebook alters the file, triggering changes the git picks up on. We use Jupytext to link notebooks with simpler markdown files that only change when actual code changes. Only the markdown files are git-tracked.
-
Templates for notebooks.
-
Quarto exports to html
-
What a readme entails, more boilerplate, base minimum viable documentation
-
Interoperability between python and R.
-
Dockerize certain complex environments like gdal/mapping, similar to the Selenium container.
-
Pinning Python interpreter versions.
-
yaml config and dockerfile for CI tasks. Can we make it part of datakit's interactive setup.
After running datakit project create
with the python UV template, jupyter should be installed in the .venv directory of your project, complete with a customized jupyter ipython kernel named after the project. From there, if there is a package beyond jupyterlab
, ipython
, jupytext
, and jupyterlab_templates
that is necessary for the project, use uv add [package]
to install it. If there is a package you like that is not necessary to the project (think dev tools, like vim keybindings) then install it with uv pip install [package]
. These packages will not be installed when a teammate clones your project and runs the initial setup. To get rid of a package us uv remove [package]
. To upgrade a package, you can use uv lock --upgrade-package [package]
.
If you're cloning an existing UV project, follow these steps:
git clone ap-project
cd ap-project
uv venv
# make the virtual environmentuv sync
# install necessary packages
You can set the default name, email, etc. for a project in the cookiecutter.json
file.
When cloning a Datakit project that someone else created, you will need to create a virtual environment and install the dependencies. You can do this by running the following from the termminal.
uv venv
uv sync