geocompx / docker

Dockerfiles for Geocomputation

Home Page:https://github.com/geocompx/docker/pkgs/container/docker

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Refactor images, start from same base

Robinlovelace opened this issue · comments

As discussed with @benz0li, we could use the b-data stack throughout, with more consistency and good R/Python support. First we would need to think about similarities/differences and pros/cons, leaving this as an open question, I'm a bit out of my depth, so any advice / PRs welcome : )

Feedback regarding image size (uncompressed): glcr.b-data.ch/jupyterlab/r/tidyverse (3.18 GB) vs rocker/tidyverse (2.38GB):

glcr.b-data.ch/jupyterlab/r/tidyverse is 800 MB bigger because

  1. code-server is bigger than RStudio Server OSE1
  2. there is JupyterLab (incl. dependencies) installed
  3. some R packages have been moved up the build chain

Footnotes

  1. There are many Code extensions pre-installed.

Feedback regarding image size (uncompressed): glcr.b-data.ch/jupyterlab/python/scipy (4.35 GB) vs jupyter/scipy-notebook (4.14GB).

There are too many differences (see below) and it is sheer coincidence that the images are almost the same size.

JupyterLab R docker stack vs Rocker images (versioned stack):

Differences

  1. Base image: Debian instead of Ubuntu
    (CUDA-enabled images are Ubuntu-based)
    • Unminimized, i.e. including man pages
  2. IDE: JupyterLab + code-server instead of RStudio
  3. Source builds1 installed for
    • Python
    • Git
    • Git LFS
  4. Shell: Zsh
  5. Additional image glcr.b-data.ch/jupyterlab/r/qgisprocess
  6. Missing images:
    • .../geospatial:dev-osgeo
    • .../shiny
    • .../shiny-verse

Similarities

  1. Image names
    Exceptions:
    • .../ver vs .../r-ver
    • .../base vs .../rstudio
  2. Installed R packages
    • some R packages have been moved up the build chain

Footnotes

  1. Newer versions than the distros's repository. Installed at /usr/local/bin.

JupyterLab Python docker stack vs Jupyter Docker Stacks:

Differences

  1. Base image: Debian instead of Ubuntu
    • Unminimized, i.e. including man pages
  2. IDE: code-server next to JupyterLab
  3. Just Python – no Conda / Mamba
  4. GPU accelerated images available
    (CUDA-enabled images are Ubuntu-based)
  5. Source builds1 installed for
    • Python
    • Git
    • Git LFS
  6. Shell: Zsh
  7. The scipy image includes Quarto and TinyTeX

Similarities

  1. Image names
    • b-data's docker stack only offers base and scipy2
  2. Installed Python packages

Footnotes

  1. Newer versions than the distro's repository. Installed at /usr/local/bin.

  2. Any other Python package can be installed at user level.
    👉 Because my images allow (bind) mounting the whole home directory, user data is persisted – i.e. survives container restarts. (Cross reference: https://github.com/jupyter/docker-stacks/issues/1478)

b-data's JupyterLab docker stack:

Pros

  1. All JupyterLab docker stacks include the same tool set
    • Integrated Development Environments: JupyterLab + code-server
    • Programming Language: Python (+ R or Julia)
    • GPU accelerated images available
  2. Data Science Dev Containers with similar features

ℹ️ The JupyterLab docker stacks and Data Science Dev Containers also support rootless mode.

Cons

  1. b-data is a one-man GmbH (LLC)
    • Time for image maintenance is limited
  2. No financial support (sponsorship)
    • Execption: JupyterLab R docker stack
      ℹ️ Work partially funded by Agroscope.
  3. No testing of docker images
  4. Using own Docker Registry
    • No plan to release on Docker Hub or Quay

Any other similarities/differences and pros/cons that come to your mind?

  1. @mathbunnyru: Regarding Jupyter Docker Stacks vs JupyterLab Python docker stack
  2. @eitsupi: Regarding Rocker images (versioned stack) vs JupyterLab R docker stack

I would like to give @Robinlovelace a solid base for his decision on how to refactor the images (possibly use b-data's image as a base).

I would also add as pros of https://github.com/jupyter/docker-stacks:

  • automatic updates (including security updates of Ubuntu base image and Python packages)
  • better image testing
  • better image tagging
  • readable build manifests for images

And cons:

  • building one image set at a time (this is a choice the project made, but still, it might be a downside for people who want old python version with modern packages at the same time)

@mathbunnyru Thank you for the feedback.

I would also add as pros of https://github.com/jupyter/docker-stacks:

  • automatic updates (including security updates of Ubuntu base image and Python packages)
  • better image testing
  • better image tagging
  • readable build manifests for images

I fully agree. Cross reference regarding image tagging and readable build manifests: https://github.com/jupyter/docker-stacks/wiki/2023-11

And cons:

  • building one image set at a time (this is a choice the project made, but still, it might be a downside for people who want old python version with modern packages at the same time)

Same for b-data's JupyterLab docker stacks. Rocker images (versioned stack) handles this differently.

Many thanks for all the detailed info + thoughts. At present I'm erring towards b-data's JupyterLab docker stack with data science devcontainers.

@eitsupi Any feedback from your side?

I'm sorry, but since I don't know the context, I don't think I can give any particular advice.

If I had to say, I'm very reluctant to support Python on https://github.com/rocker-org/rocker-versioned2, unlike @cboettig and @yuvipanda, and think it's better for users to install their favorite version of Python using micromamba, rye, or something.
(Additionally, R can now be installed using rig, making it very easy to install your favorite R version.)

Thanks all, any other general thoughts welcome. Seeing as we're talking about Python a Python-focussed image makes sense, and good idea re. rig, do you know of any example Dockerfiles that use it @eitsupi ?

re. rig, do you know of any example Dockerfiles that use it @eitsupi ?

The rig repository has it.
https://github.com/r-lib/rig/blob/1e335785f95c3669bf04a43bc6a0da4862e401db/containers/r/Dockerfile

Just install it using curl and run the rig add command.

Awesome. Many thanks!

@Robinlovelace Great discussion and tricky issues here. Love the work you are doing in this space and bridging between these communities.

For a geospatial-focused python docker images, I would definitely look to the Pangeo stack: https://github.com/pangeo-data/pangeo-docker-images. @yuvipanda can correct me if I'm wrong, but I believe these are derived from the standard Jupyter stack linked above, and importantly (imo) they include prebuilt images with gpu support which can be a common stumbling block. Note that the Microsoft Planetary Computer docker images are also derived from the Pangeo stack -- it's well maintained and widely used.

As @eitsupi mentions, @yuvipanda and I do want to see good native python support in Rocker, especially when it comes to geospatial. As you know, there really isn't a clean separation of R from python here -- the core geospatial C libraries like gdal have (technically optional but important) dependencies on python already. I also think there's a compelling case for users, instructors & platform providers to want to provide a similar base that can work across Python and R -- especially when a lot of the heavy lifting is being done by the same OSGeo C libraries, it makes sense to have access to the same versions of those libraries from both ecosystems. Lastly, R users in particular may benefit from a more 'batteries included' approach to python, rather than navigating the wide space of miniconda, conda, mamba, pyenv, system python and so forth with reticulate, especially when it comes to areas like geospatial where these packages are calling system c libraries (or coming prepackaged with binaries). It's relatively easy to get into a mess. rocker-org/rocker-versioned2#718 looks like a promising setup, but we're still testing.

Anyway, don't mean to say that you should definitely be using rocker for this, but only that it's a use case that we're also trying to address in the rocker/geospatial images already. The pangeo stack is already solid, and I'm glad @yuvipanda has started bringing his expertise from there over to us in rocker 🙏 .

For a geospatial-focused python docker images, I would definitely look to the Pangeo stack: https://github.com/pangeo-data/pangeo-docker-images. @yuvipanda can correct me if I'm wrong, but I believe these are derived from the standard Jupyter stack linked above, and importantly (imo) they include prebuilt images with gpu support which can be a common stumbling block. Note that the Microsoft Planetary Computer docker images are also derived from the Pangeo stack -- it's well maintained and widely used.

There is glcr.b-data.ch/jupyterlab/cuda/r/qgisprocess (CUDA-enabled JupyterLab R docker stack)


And there is glcr.b-data.ch/jupyterlab/cuda/qgis/base (CUDA-enabled JupyterLab QGIS docker stack):

CUDA screenshot

Qgis in browser with full GUI. Wow!