New option for the specification of hostfile for sos run and sos execute

Question

New option for the specification of hostfile for sos run and sos execute

BoPeng opened this issue 5 years ago · comments

It appears clear that a hostfile is needed for multi-node execution. Although a host file can be automatically generated by PBS systems, and be picked up automatically by commands such as sos execute and sos run, it is necessary to allow this option so that users can specify it manually to allow multi-node execution of workflows and tasks.

This option should work like this:

Without it, everything is run locally.
With it, it should be a name to a host file, similar to the --hostfile option of SCOOP, with a similar or identical format. The workers will be created on these hosts.
Under a cluster system with appropriate environmental variables, the hostfiles will be picked up automatically, similar to what SCOOP is doing

The problem is that sos run does not support -- options so we will have to reuse an existing option or find another option.

Once this option is specified, users can use

sos run -j hostfile

to run work flow on multiple hosts.

Use

%PBS ...
sos run workflow

to run entire workflow on a cluster system.

The same mechanism will be used for the execution of tasks, something like

%PBS
sos execute task

Bo · Answer 1 · Sun Jul 28 2019 22:27:11 GMT+0800 (China Standard Time)

We could reuse -j and say

-j 4 is 4 processes at local host
-j 4 some_machine:4 is 4 at localhost and 4 on some_machine
-j @file

For the last usage, we do not have to say it is a file, rather the use of @ syntax from the fromfile_prefix_chars syntax from argparse.

gaow · Answer 2 · Mon Jul 29 2019 00:55:59 GMT+0800 (China Standard Time)

This interface reads intuitive. Not sure if I understand 2: when running remote tasks, -j option specifies the resource needed to manage the remote tasks on the machine the tasks are submitted. Not sure why 2 is necessary -- dont we always manage it from localhost?

Bo · Answer 3 · Mon Jul 29 2019 01:17:36 GMT+0800 (China Standard Time)

This interface mimics the execution model of SCOOP, namely the non-cluster multi-node execution of workflows. It is added because PBS systems generate node files (albeit different formats) to specify the nodes for the execution of things on cluster, and sos is supposed to read the node files and start workers on remote nodes.

That means there is no need to differentiate cluster and non-cluster multi-node execution, and we can say

-j 4 is the same as -j localhost:4 for local execution with specified number of workers
-j node1 node2 node3 starts workers on node1, node2, and node3, utilizing default number of workers depending on cores of workers.
-j node1:4 node3:4 node3:4 specifies number of worker processes on each node.
-j @file uses parameters in file.
without -j, we make use of something like ncores/2 (we have a more complex formula) processes on local host. And we use nodes specified in nodefile specified by PBS system if we are on a cluster system.

In all cases, the first node should be the "master" when the master process will be executed.

Now, this option will be used by both sos run and sos execute

sos run script -j node1 node2

to execute the entire workflow on multiple nodes. We can also put this in a PBS system, then the syntax would mostly like

%PBS nodes=5:ppn=5
sos run -q none

when nodefile is used implicitly.

The -j option for sos execute should mostly be kept unknown (for debug purpose perhaps), and be used as

%PBS nodes=5:ppn=5
sos execute task_id

to execute single multi-node task, or single multi-task master task.