vatlab / sos

SoS workflow system for daily data analysis

Home Page:http://vatlab.github.io/sos-docs

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Summary of new features for named input/output.

BoPeng opened this issue · comments

commented

Persistent grouping of sos_targets

input: `a.txt`, `b.txt`, group_by=1

will be considered as equivalent to

input: sos_targets('a.txt', 'b.txt', group_by=1)

which creates a sos_targets with two targets and two groups, with groups accessible with property groups, which is a list of sos_targets with no subgroups.

sos_targets will keep its grouping information when it is passed around. That is to say

  • step_input will have groups that are essentially _input for substeps.
  • step_output will contain _output from each substep as its groups.

keyword arguments in input and output

Keyword arguments used to specify sources of targets.

input: name=targets
output: name=targets

Named input and output can be accessed by _input['name'] and _output['name'].

Implementation-wise,

input: name=targets

creates step_input as sos_targets(name=targets), which assigns sources of targets to name.

output_from(steps, **kwargs) to get output from other steps

Refers to output from one or more steps, parameter can be a name or a number. The latter refers to a step in the same workflow (output_from(10) from step_20 is equivalent to output_from('step_10')).

input: output_from('step')
input: output_from(1)
input: output_from([1, 2])

with named input and output, the syntax can be expanded to

input: ref=output_from('get_ref')['ref']

A special step name -1 as in

input: output_from(-1)

is reserved to output from previous step, which is only valid from a numerically indexed steps.

Options group_by, paired_with, pattern, group_with, and for_each can be used to regroup or attach variables to the output. For example, group_by can be used to regroup the retrieved sos_targets,

input: output_from(10, group_by='all')

named_output('name', **kwargs) for data flow without step name

named_output('ref') in the following example refers to any step with ref in named output,

[A]
output: ref=targets

[B]
input: named_output('ref')

which has the same effect with output_from('A')['ref'] but does not need the specification of step name.

Similar to output_from, parameters group_by, paired_with, pattern, group_with, for_each can be used to regroup or attached variables to retrieved targets.

Merging of multiple sos_targets

Multiple sos_targets can be specified in the input statement, either explicitly with sos_targets, or implicitly with output_from, named_output. In this case, targets and groups from multiple sos_targets will be merged. sos_targets objects with different numbers of groups can be merged only if one of them has no group information or has a single group with all targets. In this case the group will be replicated for all groups before merging.

For example,

input: 'a.txt', 'b.txt', sos_targets('c.txt', 'd.txt', group_by=1)

will create a sos_targets with four targets 'a.txt', 'b.txt', 'c.txt', 'd.txt', and two groups

'a.txt', 'b.txt', 'c.txt'
'a.txt', 'b.txt', 'd.txt'

The same rule applies to sos_targets created by output_from() or output_from(group_by). However, if a global group_by option is present, all individual groups will be overridden. That is to say,

input: 'a.txt', 'b.txt', output_from(10), group_by=1

will regroup all targets by 1, regardless of original grouping information from output_from(10).

set and get of attributes to sos targets

New functions are added BaseTarget.set(), BaseTarget.get()

A dictionary are now associated with each BaseTarget and can be access with .set() and .get() function, or as an attribute of the target. The .set() function is usually done automatically by parameters paired_with and group_with, but can be used directly. With

a = file_target('a.txt')
a.set('name', 'a')

it is usually easier to use

a.name

instead of

a.get('name')

but a.get('name', default=None) will return a default value instead of raising an AttributeError if name does not exist, which can be safer to use from time to time.

Changes to parameters paired_with, group_with and for_each

In addition to variables set to the global namespace, the paired values are written to _input as target or group properties. That is to say, with

sample = ['A',  'B']
files = ['a1', 'a2', 'a3', 'a4']
input: 'a1.txt', 'a2.txt', 'b1.txt', 'b2.txt', group_by=2, 
    paired_with='files', group_with='sample', for_each=dict(i=range(5))

you can access _sample, _files, and i both directly, and as

_input[0]._files
_input._sample
_input.i

So that

sample = ['A',  'B']
files = ['a1', 'a2', 'a3', 'a4']
input: 'a1.txt', 'a2.txt', 'b1.txt', 'b2.txt', group_by=2, 
    paired_with='files', group_with='sample', for_each=dict(i=range(5))

print(f'_input={_input}, _files={_files}, _sample={_sample}, i={i}')
print(f'_input[0]._files={_input[0]._files}, _input._sample={_input._sample}, _input.i={_input.i}')

would produce:

_input=a1.txt a2.txt, _files=['a1', 'a2'], _sample=A, i=0
_input[0]._files=a1, _input._sample=A, _input.i=0
_input=b1.txt b2.txt, _files=['a3', 'a4'], _sample=B, i=0
_input[0]._files=a3, _input._sample=B, _input.i=0
...
commented

Random thoughts on names:

output_from_step(1)
output_with_name('ref')
commented

3a4c41f has the first test case that fails due to incompatibility.

commented

An example for 'persistent' variables is

[10]
input: for_each=dict(i=range(5))
output: f'a_{i}.txt'
_output.touch()

[20]
print(i)

which produces

0
1
2
3
4

because i is set to _output, then as groups of step_output of step 10, then as step_input of step 20, then the groups are unpacked, and the groups variables are populated to the step namespace.