ZihengSun / kerchunk-cookbook

Project Pythia cookbook for Kerchunk

Home Page:https://projectpythia.org/kerchunk-cookbook/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

thumbnail

Kerchunk Cookbook

nightly-build Binder DOI

This Project Pythia Cookbook covers using the Kerchunk library to access archival data formats as if they were ARCO (Analysis-Ready-Cloud-Optimized) data.

Motivation

The Kerchunk library allows you to access chunked and compressed data formats (such as NetCDF3. HDF5, GRIB2, TIFF & FITS), many of which are the primary data formats for many data archives, as if they were in ARCO formats such as Zarr which allows for parallel, chunk-specific access. Instead of creating a new copy of the dataset in the Zarr spec/format, Kerchunk reads through the data archive and extracts the byte range and compression information of each chunk, then writes that information to a .json file (or alternate backends in future releases). For more details on how this process works please see this page on the Kerchunk docs). These summary files can then be combined to generated a Kerchunk reference for that dataset, which can be read via Zarr and Xarray.

Authors

Raphael Hagen

Much of the content of this cookbook was inspired by Martin Durant, the creator of Kerchunk and the Kerchunk documentation.

Contributors

Structure

This cookbook is broken up into two sections, Foundations and Example Notebooks.

Section 1 Foundations

In the Foundations section we will demonstrate how to use Kerchunk to create reference sets from single file sources, as well as to create multi-file virtual datasets from collections of files.

Section 2 Case Studies

The notebooks in the Case Studies section demonstrate how to use Kerchunk to create datasets for all the supported file formats. Kerchunk currently supports NetCDF3, NetCDF4/HDF5, GRIB2, TIFF (including CoG) and FITS, but more file formats will be available in the future.

Future Additions / Wishlist

This Pythia cookbook is a start, but there are many more details of Kerchunk that could be covered. If you have an idea of what to add or would like to contribute, please open up a PR or issue.

Some possible additions:

  • Diving into the details: The nitty-gritty on how Kerchunk works.
  • Kerchunk and Parquet: what are the benefits of using parquet for reference file storage.
  • Appending to a Kerchunk dataset: How to schedule processing of newly added data files and how to add them to a Kerchunk dataset.

Running the Notebooks

You can either run the notebook using Binder or on your local machine.

Running on Binder

The simplest way to interact with a Jupyter Notebook is through Binder, which enables the execution of a Jupyter Book in the cloud. The details of how this works are not important for now. All you need to know is how to launch a Pythia Cookbooks chapter via Binder. Simply navigate your mouse to the top right corner of the book chapter you are viewing and click on the rocket ship icon and be sure to select “launch Binder”. After a moment you should be presented with a notebook that you can interact with. You’ll be able to execute and even change the example programs. The code cells have no output at first, until you execute them by pressing {kbd}Shift+{kbd}Enter. Complete details on how to interact with a live Jupyter notebook are described in Getting Started with Jupyter.

Running on Your Own Machine

If you are interested in running this material locally on your computer, you will need to follow this workflow:

  1. Install mambaforge/mamba

  2. Clone the https://github.com/ProjectPythia/kerchunk-cookbook repository:

     git clone https://github.com/ProjectPythia/kerchunk-cookbook.git
  3. Move into the kerchunk-cookbook directory

    cd kerchunk-cookbook
  4. Create and activate your conda environment from the environment.yml file. Note: In the environment.yml file, Kerchunk` is currently being installed from source as development is happening rapidly.

    mamba env create -f environment.yml
    mamba activate kerchunk-cookbook
  5. Move into the notebooks directory and start up Jupyterlab

    cd notebooks/
    jupyter lab

About

Project Pythia cookbook for Kerchunk

https://projectpythia.org/kerchunk-cookbook/

License:Apache License 2.0


Languages

Language:Jupyter Notebook 99.5%Language:Shell 0.5%