This repository provides a common framework to build, test and deploy Docker container images at CSCS. The repository is organised as a collection of folders where each folder contains one or more of the Dockerfiles describing the build process of the containerized application and optional files needed for testing.
This repository is registered at CSCS CI/CD service and is using build farm infrastructure to run pipelines for any of the applications. For each pipeline three stages are defined:
stages:
- build
- test
- deploy
and each stage can define multiple jobs that GitLab runner will execute. The pipeline is triggered by the pull-request to the main
branch of repository (1). Then CI/CD middleware triggers the rebuild of all changed Dockerfiles on the build farm (2). In case of success build farm stores the resulting container in the temporary location inside CSCS container registry (3). After that the optional test
stage is executed on the HPC platform to test the container (4) and finally the container is pushed to a persistent location in the container registry. Please have a look at CI/CD template file for the definition of basic templates for each of the stages.
❗ Important: at the moment build farm runs on AMD zen2 architecture |
---|
New recipes are added via pull request to the main
branch. Each PR must contain:
- one or more of the Dockerfiles
ci.yml
- script that defines the CI/CD pipelineREADME.md
- short description of the application and Dockerfile(s)
The PR will be reviewed and a new pipeline will be added to the project. Minimalistic ci.yml
should contain the following:
include:
- local: '/ci/common.yml'
build my image:
extends: .build-image
variables:
DOCKERFILE: myapp/Dockerfile
DOCKER_BUILD_ARGS: '["VAR=value"]'
deploy my image:
extends: .deploy-image-jfrog
needs: ["build my image"]
variables:
APP: 'myappname'
ARCH: 'a100'
VERSION: '1.0'
with the the following directory structure for myapp
:
container-recipes/
| ci/
| myapp/
| | Dockerfile
| | ci.yml
| | README.md
The example pipeline above uses two stages - build my image
and deploy my image
. You can avoid deployment stage, but then your container will be stored in a temporary location in CSCS container registry and cleaned up at any time without notice. More information about variables used in each stage is available in the next section.
If the build stage is successful, the final image is pushed to $CSCS_REGISTRY/contbuild/apps/public/$ARCH/$APP:$VERSION
. At the moment CSCS_REGISTRY
variable points to https://jfrog.svc.cscs.ch/artifactory.
Run sarus pull https://jfrog.svc.cscs.ch/artifactory/contbuild/apps/public/a100/myappname:1.0
from the compute or login node to pull application's image to your local working directory.
We use GitLab runner to define and execute logic of the pipelines. The full documentation for GitLab ci.yml
keywords is available here. On top of it CSCS middleware defines several variables to configure slurm and container engine runners. On top of both our CI/CD template file defines the following templates to build, test and deploy images.
Template for building images. Main variables:
stage: build
variables:
# name of the Dockerfile
- DOCKERFILE: string
# (optional) list of arguments passed to `docker build`
# command via `--build-arg` argument
- DOCKER_BUILD_ARGS: string of the parameters list, for example '["VAR=value"]'
Example:
build nvhpc cuda qe71 image:
extends: .build-image
variables:
DOCKERFILE: QuantumESPRESSO/nvhpc/cuda/Dockerfile
DOCKER_BUILD_ARGS: '["CUDA_ARCH=80", "QE_VERSION=qe-7.1"]'
Templates for running ReFrame tests on cpu or a100 architectures. Reframe templates do not take any arguments, however a global variable REFRAME_ARGS
that defines ReFrame arguments to pass container image name and name of the test must be defined. For example:
# global variable definition
variables:
REFRAME_ARGS: '-S myapp_image="$BASE_IMAGE" -c cscs-reframe-tests/checks/apps/myapp/myapp_check.py'
Then the ReFrame test stage is defined simply as:
test myapp a100:
needs: ["build my image"]
extends: .run-reframe-test-a100
ReFrame command will be executed by a gitlab runner from the root folder of the repository (from ./container-recipes
). Inside Python test the following line will be relevant
myapp_image = variable(str, value='NULL')
Template for running Slurm jobs on a cpu or a100 partition of a TDS cluster. Main variables:
stage: run
variables:
# ('YES'/'NO') use MPI hook to inject Cray's optimised mpich library into container
- USE_MPI: string
# number of nodes for the slurm job (-N argument of srun)
- SLURM_JOB_NUM_NODES: integer
# number of CPU cores per MPI rank (-c argument of srun)
- SLURM_CPUS_PER_TASK: integer
# total number of MPI ranks (-b argument of srun)
- SLURM_NTASKS: integer
# a set of commands that will be executed inside sbatch submission script
script:
- command1
- command2
- ...
Execution starts from the root folder of the project. Example:
run myapp cpu:
extends: .run-test-hohgant-cpu
needs: ["build myapp"]
variables:
USE_MPI: 'YES'
SLURM_JOB_NUM_NODES: 1
SLURM_CPUS_PER_TASK: 8
OMP_NUM_THREADS: $SLURM_CPUS_PER_TASK
SLURM_NTASKS: 16
script:
- cd myapp/test
# run myapp inside container on 16 MPI ranks with 8 threads/rank
- /path/to/myapp [optional arguments]
Deploy image at persistent location in container registry. Main variables:
stage: deploy
variables:
# name of the application (no capitals and no special symbols)
- APP: string
# container's target architecture (can be one of a100, zen2, zen3)
- ARCH: string
# version of the application
- VERSION: string
Example:
deploy myapp:
extends: .deploy-image-jfrog
variables:
APP: 'myapp'
ARCH: 'a100'
VERSION: '1.0'