New option for the specification of hostfile for sos run and sos execute
BoPeng opened this issue · comments
It appears clear that a hostfile is needed for multi-node execution. Although a host file can be automatically generated by PBS systems, and be picked up automatically by commands such as sos execute
and sos run
, it is necessary to allow this option so that users can specify it manually to allow multi-node execution of workflows and tasks.
This option should work like this:
- Without it, everything is run locally.
- With it, it should be a name to a host file, similar to the
--hostfile
option of SCOOP, with a similar or identical format. The workers will be created on these hosts. - Under a cluster system with appropriate environmental variables, the hostfiles will be picked up automatically, similar to what SCOOP is doing
The problem is that sos run
does not support --
options so we will have to reuse an existing option or find another option.
Once this option is specified, users can use
sos run -j hostfile
to run work flow on multiple hosts.
Use
%PBS ...
sos run workflow
to run entire workflow on a cluster system.
The same mechanism will be used for the execution of tasks, something like
%PBS
sos execute task
We could reuse -j
and say
-j 4
is 4 processes at local host-j 4 some_machine:4
is 4 at localhost and 4 onsome_machine
-j @file
For the last usage, we do not have to say it is a file, rather the use of @
syntax from the fromfile_prefix_chars
syntax from argparse
.
This interface reads intuitive. Not sure if I understand 2: when running remote tasks, -j
option specifies the resource needed to manage the remote tasks on the machine the tasks are submitted. Not sure why 2 is necessary -- dont we always manage it from localhost
?
This interface mimics the execution model of SCOOP, namely the non-cluster multi-node execution of workflows. It is added because PBS systems generate node files (albeit different formats) to specify the nodes for the execution of things on cluster, and sos
is supposed to read the node files and start workers on remote nodes.
That means there is no need to differentiate cluster and non-cluster multi-node execution, and we can say
-j 4
is the same as-j localhost:4
for local execution with specified number of workers-j node1 node2 node3
starts workers onnode1
,node2
, andnode3
, utilizing default number of workers depending on cores of workers.-j node1:4 node3:4 node3:4
specifies number of worker processes on each node.-j @file
uses parameters infile
.- without
-j
, we make use of something likencores/2
(we have a more complex formula) processes on local host. And we use nodes specified in nodefile specified by PBS system if we are on a cluster system.
In all cases, the first node should be the "master" when the master process will be executed.
Now, this option will be used by both sos run
and sos execute
sos run script -j node1 node2
to execute the entire workflow on multiple nodes. We can also put this in a PBS system, then the syntax would mostly like
%PBS nodes=5:ppn=5
sos run -q none
when nodefile
is used implicitly.
The -j
option for sos execute
should mostly be kept unknown (for debug purpose perhaps), and be used as
%PBS nodes=5:ppn=5
sos execute task_id
to execute single multi-node task, or single multi-task master task.