There are 10 repositories under slurm topic.
Machine Learning Engineering Open Book
A DSL for data-driven computational pipelines
dstack is an open-source control plane for running development, training, and inference jobs on GPUs—across hyperscalers, neoclouds, or on-prem.
A scalable, efficient, cross-platform (Linux/macOS) and easy-to-use workflow engine in pure Python.
Best practices & guides on how to write distributed pytorch training code
Lightweight fast function pipeline (DAG) creation in pure Python for scientific (HPC) workflows 🕸️🧪
A Slurm cluster using docker-compose
TorchX is a universal job launcher for PyTorch applications. TorchX is designed to have fast iteration time for training/research and support for E2E production ML pipelines when you're ready.
A scheduler for GPU/CPU tasks
Create clusters of VMs on the cloud and configure them with Ansible.
An open-source toolkit for deploying and managing high performance clusters for HPC, AI, and data analytics workloads.
Simplify HPC and Batch workloads on Azure
A Cross-Platform, Multi-Cloud High-Performance Computing Platform
Prometheus exporter for performance metrics from Slurm.
Tools for computation on batch systems
Run Slurm on Kubernetes. A Slinky project.
A simple Snakemake profile for Slurm without --cluster-config
R package to send function calls as jobs on LSF, SGE, Slurm, PBS/Torque, or each via SSH
A toolset for black-box hyperparameter optimisation.
Funnel is a toolkit for distributed task execution via a simple, standard API.
A collection of various resources, examples, and executables for the general NREL HPC user community's benefit. Use the following website for accessing documentation.
Slurm in Docker - Exploring Slurm using CentOS 7 based Docker images
Singularity implementation of k8s operator for interacting with SLURM.
Slurm-Mail is a drop in replacement for Slurm's e-mails to give users much more information about their jobs compared to the standard Slurm e-mails.
A template for starting reproducible Python machine-learning projects with hardware acceleration. Find an example at https://github.com/CLAIRE-Labo/no-representation-no-trust
A TUI application for monitoring and managing SLURM jobs.