This repository is a tutorial for the Kochi workflow management tool.
Kochi: https://github.com/s417-lama/kochi
Kochi assumes that the project for running experiments is managed as a git project.
In the following tutorial, we use this repository kochi-tutorial
as the concerning project.
This section explains how to run Kochi on a local computer without using login or compute nodes.
Install Kochi to the local computer:
pip3 install git+https://github.com/s417-lama/kochi.git
The default workspace for Kochi is created at ~/.kochi
.
You can change this location by setting the KOCHI_ROOT
environmental variable.
To follow this tutorial, clone this repository:
git clone https://github.com/s417-lama/kochi-tutorial
cd kochi-tutorial/
In Kochi, workers must be created to run jobs.
To spawn a worker on the local computer, run:
kochi work -m local -q test -b
Options:
-m
: machine name (local
is a special machine name for local execution)-q
: job queue name-b
: blocking or not
By running the above command, a new job queue test
is created, and the new worker works on the jobs submitted to the queue test
.
Without the -b
option, the worker immediately exits when no job is found in the queue.
The -m
option is mandatory, but you can omit it by setting the default machine to the local computer by setting:
export KOCHI_DEFAULT_MACHINE=local
Omitting -m local
option:
kochi work -q test -b
(In the following, we omit the -m
option.)
You will see the following output by running the above command:
Kochi worker 0 started on machine local.
================================================================================
You can even spawn multiple workers that work on the same job queue.
Jobs are atomically popped from the job queue (using flock
).
In Kochi, jobs are first submitted to job queues, and then they are popped by the workers and executed.
To submit a job to the queue named test
, run the following command by opening a new terminal without closing the worker terminal:
kochi enqueue -q test -n test_job -- echo foo
Then, you will see that the job echo foo
is executed by the worker spawned above:
Kochi job test_job (ID=0) started.
--------------------------------------------------------------------------------
foo
You can specify a job name by the -n
option but not required to do so.
By specifying the -c
option, you can create a copy of the current git repository (kochi-tutorial
) to the worker's workspace.
To check it, run:
kochi enqueue -q test -n test_job -c -- bash -c "pwd && ls"
You will see that the command is executed in a different location with a copy of the current workspace:
Kochi job test_job (ID=1) started.
--------------------------------------------------------------------------------
$KOCHI_ROOT/workers/local/workspace_0/kochi-tutorial
fib.cpp
fib.yaml
kochi.yaml
README.md
run_fib.bash
Uncommitted changes in the current workspace are also copied to the worker's workspace. However, the changes at the worker's workspace are not reflected to the original workspace.
To check it, run:
echo hello > foo
kochi enqueue -q test -n test_job -c -- bash -c "echo world >> foo && cat foo"
Then, you will see:
Kochi job test_job (ID=2) started.
--------------------------------------------------------------------------------
hello
world
After the job execution, run locally:
$ cat foo
hello
It shows that the job execution is isolated from the original workspace.
Note that the current workspace's state is packed into a job at the job submission time, and later changes to the current workspace are not reflected to the job execution. This is particularly useful when you are checking the performance results while actively modifying the code, through trial and error.
You can submit an interactive job to the queue test
by running:
kochi interact -q test -c
This will launch a shell on the worker's workspace and connect the local terminal to it. This feature is particularly useful when the worker is launched on a compute node.
To list currently enqueued jobs:
kochi show jobs
To include terminated jobs:
kochi show jobs -t
To increase the maximum number of jobs shown:
kochi show jobs -t -l 200
To show job execution status:
kochi show job <job_id>
To show the job output:
kochi show log job <job_id>
To cancel a job:
kochi cancel <job_id>
To cancel all jobs:
kochi cancel -a
The project-specific configuration is managed in the kochi.yaml
file.
In this tutorial, we will install MassiveThreads as a dependency of this project and run a task-parallel program built on top of MassiveThreads.
Each project dependency is identified by its name and recipe.
For example, the configuration of a dependency named massivethreads
looks like:
dependencies:
massivethreads:
git: git@github.com:massivethreads/massivethreads.git
recipes:
- name: release
branch: master
envs:
CFLAGS: -O3 -Wall
script:
- gcc --version
- ./configure --prefix=$KOCHI_INSTALL_PREFIX --disable-myth-ld --disable-myth-dl
- make -j
- make install
Settings:
git
: URL for the git repositoryrecipes
: the list of recipesscript
: shell script to install the dependency to the$KOCHI_INSTALL_PREFIX
dir
The settings in each recipe will overwrite the settings in the parent level.
To install the release
recipe, run:
kochi install -d massivethreads:release
The format for the dependency (-d
) is <dependency_name>:<recipe_name>
.
To check installation:
$ kochi show installs
Dependency Recipe State Installed Time
-------------- -------- ------------- --------------------------
massivethreads release installed 2023-04-18 17:35:43.563509
massivethreads debug NOT installed
massivethreads develop NOT installed
jemalloc v5.2.1 NOT installed
Let's install another recipe (debug
), which is configured with -O0
CFLAGS:
kochi install -d massivethreads:debug
If you want to make some changes to the MassiveThreads code to check its behaviour (e.g., printf debugging), you can mirror the local directory status to the installation of the dependency.
For example, suppose that you clone MassiveThreads to the ../massivethreads
dir and make some changes to the local code.
Then, you can use that state of the project as a dependency to this kochi-tutorial
project, by setting a recipe as follows:
recipes:
- name: develop
mirror: true
mirror_dir: ../massivethreads
Note that you do not need to commit your changes to git in the ../massivethreads
dir.
The diff from the current HEAD will be copied and applied to the dependency build workspace.
If you install the massivethreads:develop
recipe and specify it as a dependency for a job, the modified version of the code will run.
This feature is useful for debugging runtime systems or libraries.
Dependencies are not necessarily a git repository.
For example, you can install jemalloc
from a web tarball:
kochi install -d jemalloc:v5.2.1
See kochi.yaml
for the specific configuration.
To run a job with the dependency massivethreads:release
, run:
kochi enqueue -q test -d massivethreads:release -c ./run_fib.bash
run_fib.sh
will compile the fib.cpp
program and run it.
The installation path to the dependency is passed to the job as an environmental variable $KOCHI_INSTALL_PREFIX_MASSIVETHREADS
.
The output will look like:
Kochi job ANON (ID=3) started.
--------------------------------------------------------------------------------
KOCHI_INSTALL_PREFIX_MASSIVETHREADS=$KOCHI_ROOT/projects/kochi-tutorial/install/local/massivethreads/release
fib(35) = 9227465
Execution Time: 151809458 ns
By specifying the debug
recipe to the job submission:
kochi enqueue -q test -d massivethreads:debug -c ./run_fib.bash
You will get much longer execution time (because of the -O0
build):
Kochi job ANON (ID=4) started.
--------------------------------------------------------------------------------
KOCHI_INSTALL_PREFIX_MASSIVETHREADS=$KOCHI_ROOT/projects/kochi-tutorial/install/local/massivethreads/debug
fib(35) = 9227465
Execution Time: 352041108 ns
Additionally, you can specify jemalloc as the dependency:
kochi enqueue -q test -d massivethreads:release -d jemalloc:v5.2.1 -c ./run_fib.bash
The activate_script
in kochi.yaml
is executed when the dependency is loaded.
In the case of jemalloc, LD_PRELOAD
env val is set in this field.
The above examples use a shell command or shell script for the job execution, but you can manage your jobs in a more structured way.
fib.yaml
is an example config file for executing fib.cpp
.
Fields in the job config file:
depends
: specify dependencies and default recipes on which the benchmark dependsdefault_params
: parameter names and their default valuesdefault_name
: default job name (which can be overwritten by-n
option)default_queue
: default queue name (which can be overwritten by-q
option)default_duplicates
: default number of job duplication for batch job submission (explained later)build
: how to build the benchmarkrun
: how to run the benchmark
To execute the benchmark with the default configuration, run:
kochi enqueue -q test fib.yaml
Passing a yaml config file implys the -c
option.
Then, you will see:
Kochi job fib (ID=5) started.
--------------------------------------------------------------------------------
Compiling fib...
fib(35) = 9227465
Execution Time: 149035659 ns
You can overwrite the default parameters when submitting a job.
Let's change n_input
and n_workers
:
kochi enqueue -q test fib.yaml n_input=38 n_workers=1
Then, fib(38)
will be calculated by spawning only one MassiveThreads worker:
Kochi job fib (ID=6) started.
--------------------------------------------------------------------------------
fib(38) = 39088169
Execution Time: 5052224989 ns
The parameters are passed to the job script as $KOCHI_PARAM_<param_name>
environmental variables.
If you increase the number of MassiveThreads workers:
kochi enqueue -q test fib.yaml n_input=38 n_workers=4
You will see a speedup if the machine has multiple cores:
Kochi job fib (ID=7) started.
--------------------------------------------------------------------------------
fib(38) = 39088169
Execution Time: 1183553073 ns
You may notice that the compilation is skipped for the second and later job executions.
This is because of the depend_params
field in fib.yaml
, which indicate which phase (build or run) depends on which parameters.
Only when depend_params
in the build phase are changed from the last execution (or the current workspace is modified), the benchmark will be compiled again.
To check this behaviour, specify debug=1
param to the job submission:
kochi enqueue -q test fib.yaml debug=1
You will see that the benchmark is compiled again:
Kochi job fib (ID=8) started.
--------------------------------------------------------------------------------
Compiling fib...
fib(35) = 9227465
Execution Time: 296137883 ns
You can also overwrite the dependency recipe. For example:
kochi enqueue -q test fib.yaml debug=1 -d massivethreads:debug
You may want to combine it with the interactive job submission:
kochi interact -q test fib.yaml
This will execute the benchmark on the local shell, allowing for further interaction (e.g., running gdb
for the compiled fib
executable).
Kochi provides a mechanism to submit multiple jobs with different parameters at a time and record the job execution results. This is called a batch job.
fib.yaml
includes a batch job configuration named batch_test
, which this tutorial will execute.
Fields for batch job config:
name
: overwrite the default job namedefault_name
queue
: overwrite the default queuenamedefault_name
duplicates
: overwrite the default number of job duplicationdefault_duplicates
. This number of jobs with the same parameters will be redundantly submitted to the queue.params
: overwrite the default parameters. If a list of values is given, then all combinations of them will be submitted as jobs.artifacts
: specify how to save the computational artifacts (e.g., job output and stats)
By default, all changes to the worker's workspace are discarded, but you can save the job output and files by explicitly specifying them as the job artifacts.
The types of the artifacts can be:
stdout
: a job's standard output is saved as a file at pathdest
stats
: a job's configuration and execution status are saved as a file at pathdest
file
: when a job is finished, the file in the current workspace at pathsrc
is copied to the artifact pathdest
Please be careful not to set the same file name for different jobs, as they can be overwritten by another job artifact.
You can use the following special substitutions to set different file names for each parameter etc.:
${batch_name}
will be substituted with the batch name (batch_test
in this case).${<param>}
will be substituted with the parameter value${duplicate}
will be substituted with the index (0, 1, 2, ...) for the duplicated jobs
Before submitting your first batch job, you must create a new git branch kochi_artifacts
and a separate git worktree dir for it.
The following command automatically creates a git worktree in the parent directory of kochi-tutorial
:
kochi artifact init ../kochi-tutorial-artifact
The artifacts of the batch job are pushed into a separate git branch.
The above command creates an orphan git branch for it (kochi_artifacts
).
Then, you can submit a batch job:
kochi batch fib.yaml batch_test
You will see multiple jobs are submitted to the queue fib_batch_test
.
Since batch_test
has two n_input
values, three n_workers
values, and three job duplications, it launches 18 jobs in total.
To execute these jobs, you need to launch a new worker working on the queue:
kochi work -q fib_batch_test
After all jobs are finished, the worker will automatically exit.
Now all artifacts shoule be saved in Kochi, so let's pull the artifacts:
cd ../kochi-tutorial-artifact/
kochi artifact sync -m local
Note that you need to move to the artifact worktree dir first.
You will see that the local/fib/
directory in the kochi-tutorial-artifact
branch contains the log output and stats.
This section explains how to run jobs on remote compute nodes via a login node from the local computer.
The basics of job management is the same as the local execution, but some extra machine settings are needed.
Install Kochi to the local computer, login node, and compute nodes:
pip3 install git+https://github.com/s417-lama/kochi.git
Kochi assumes that login nodes and compute nodes all share a shared file system (NFS). If not, Kochi cannot be used for that machine.
If the home directory is not shared but some directories are shared, then set KOCHI_ROOT
to within the shared directories.
This kochi-tutorial
project is not needed to be cloned on the remote nodes.
kochi.yaml
contains a machine configuration (spica
machine), but it is specific to my environment and should be modified.
Fields for machine configuration:
login_host
: host name of the login node. This node should be able to be connected by the commandssh <login_host>
. If not, please configure~/.ssh/config
accordingly.work_dir
(optional): working directory ($HOME
by default)kochi_root
(optional):KOCHI_ROOT
on remote servers ($HOME/.kochi
by default)alloc_interact_script
(optional): how to submit an interactive job to the system's job manageralloc_script
(optional): how to submit a noninteractive job to the system's job managerload_env_script
(optional): scripts to be first executed on remote servers. This can be separately set for a login node (on_login_node
) and compute nodes (on_machine
).
The following is a template for slurm:
machines:
<machine_name>:
login_host: <login_host>
alloc_interact_script: srun -w <compute_node_name> --pty bash
alloc_script: sbatch --wrap="$KOCHI_WORKER_LAUNCH_CMD" -w <compute_node_name> -o /dev/null -e /dev/null -t ${KOCHI_ALLOC_TIME_LIMIT:-0}
Please substitute the following to match your system configuration:
<machine_name>
is an arbitrary name to identify the machine<login_host>
: hostname of the login node that can be connected from the local computer by the commandssh <login_host>
<compute_node_name>
: the name of compute nodes managed by slurm
The following environmental variables are passed to the allocation scripts:
KOCHI_WORKER_LAUNCH_CMD
: scripts to launch a Kochi worker on a compute nodeKOCHI_ALLOC_TIME_LIMIT
: time limit that should be passed to the system's job manager, which is specified by the-t
option of thekochi alloc
command explained laterKOCHI_ALLOC_NODE_SPEC
: specification of the nodes (e.g., number of nodes for MPI) allocated by the system's job manager, which is specified by the-n
option of thekochi alloc
command explained later
As Kochi frequently accesses the login node via ssh, it is recommended to set the following configuration in ~/.ssh/config
:
Host <login_host>
...
ControlPath ~/.ssh/mux-%r@%h:%p
ControlMaster auto
ControlPersist yes
This enables ssh multiplexing, which persists the connection shared by multiple concurrent ssh sessions.
If you specify the alloc_interact_script
to the machine configuration, you can easily allocate an interactive job on a compute node by running the following command on the local computer:
kochi alloc_interact -m <machine_name> -q test
This will login to the <login_node>
via ssh, and execute alloc_interact_script
on the login node.
The commands to launch a Kochi worker are automatically sent to the interactive shell.
If a Kochi worker started on a compute node, it succeeded. You can submit a job from the local computer:
kochi enqueue -m <machine_name> -q test -- echo foo
Then foo
will be output in the shell on the compute node.
Of course, you can omit -m <machine_name>
by setting export KOCHI_DEFAULT_MACHINE=<machine_name>
.
To allocate noninteractive jobs:
kochi alloc -m <machine_name> -q test -f
Specify the -f
option to follow the standard output generated by the worker.
The above command will immediately exit because the job queue test
has no job.
Please try submitting some jobs before the worker launch and allocate the worker again.
To see more options:
kochi alloc --help
You can also check the output of a specific worker:
kochi show log worker <worker_id>
Note that even if you do not provide alloc_interact_script
or alloc_script
, you can manually launch workers by the command kochi work -m <machine_name> -q <queue_name>
on compute nodes.
Kochi is designed to deal with experimental results generated on multiple machines and handle them in a single repository.
In the kochi_artifacts
branch, batch job artifacts are saved in the <machine_name>
directory separately for each machine.
Internally, a artifact branch (kochi_artifacts_<machine_name>
) is created for each machine and all artifacts are first pushed into the branch.
Then, when kochi artifact sync -m <machine_name>
is executed, the kochi_artifacts_<machine_name>
branch is merged into the main kochi_artifacts
branch.
Conflict is very unlikely because the changes are usually made in each <machine_name>
directory individually.
You can login to the compute node where a worker is running by executing the following command:
kochi inspect -m <machine_name> <worker_id>
A Kochi worker launches a user-level ssh daemon at the worker startup time, so you can login to the compute node while the worker is running. This feature is useful when interactive job submission is not allowed by the system's job manager but you want to run interactive commands on the compute nodes.
Some use cases are:
- Watch the machine stats during job execution (e.g.,
top
,free
command) - Attach gdb to the executing program when a job gets stuck
You can also submit an interactive job by the command kochi interact
to operate on compute nodes.
This is different from kochi inspect
in that kochi interact
submits a job, which is executed as a child shell process of the worker's process, while kochi inspect
logins to compute nodes as a separate ssh session.
Thus, for example, if you want to debug a program in the current workspace on a compute node, kochi interact
is recommended.