Generative Nextflow

A Proof-of-Concept for Dynamic Hierarchical Workflows via Nextflow

The Concept

A workflow built with Nextflow is typically made up of a set of pre-defined steps and an explicitly defined workflow. But what if the workflow steps are dynamic depending on the data. Can we (or should we) make dynamics workflows that are generated based on the structure of the data?

Imagine a dataset of samples with a hierarchical structure:

     A
    / \
   /   B 
  /   / \
 C   D   E

We need to compute some arbitrary process for each node in the tree, however the calculation for each node is also dependent on the output of the parent node. In this case, A needs to be processed first, then B and C can be computed in parallel. Once B is finished, D and E can be computed in parallel etc. If we needed to do this repetitively for the same data and structure, we might define a workflow. But what if we want to apply the same method to a dataset with a different structure? We would want to automate the generation of these workflows.

Typically when:

Processing many datasets with different hierarchies
Large hierarchies with many branch points
Compute requires high-performance architectures
Parallelization becomes important

Useful Applications

In genomics, data often has a hierarchical structure.

For example the left shows clustering of immune cells from scRNA-seq data of PBMCs. If modeling this data with a shared learning approach, modeling each cell type might depend on information up the hierarchy. Therefore it would make sense to build a workflow that starts at the root of the tree. But what if we want to use the same modeling approach to a hierarchy of tissue-specific RNA-seq samples from GTEx? We could simply re-generate the workflow with this new structure and some minor adjustments to the file paths and parameters.

While this is a niche example, this idea can be applied to any dynamic hierarchical workflow. Here is a lightweight proof-of-concept in Python for implementing the described idea.

A Basic Example

     A
    / \
   /   B 
  /   / \
 C   D   E

Generating workflows dynamically requires two things. 1) A file defining the hierarchy of processes and 2) A representation of the individual workflow components that are pieced together. Workflow components can be simple multi-line strings that are not modified (e.g. the workflow header) or they are modules with placeholders (e.g. workflow processes). Modules are essentially reusable templates for Nextflow processes.

Hierarchy Definition

The hierarchy definition defines the workflow structure (e.g. which processes are data-dependent) as well as the processes themselves.

example.csv

process,module,params
  -> A,echo,word=cat
A -> B,join,word=bird
A -> C,join,word=horse
B -> D,join,word=lion
B -> E,join,word=fish

Outputs can be used as inputs to one or more processes
- Parent nodes can have multiple children
- Child nodes have a single parent
Modules can be repeatable or different
Modules can take one or more keyword arguments

Based on the above example, we need to process A with the echo module which takes the keyword argument word while B, C, D, E are processed with the join module which also takes the same argument. Computing A depends on nothing, while computing B depends on A and so on.

import gnf

df = gnf.read_data("example.csv")
tree = gnf.build_tree(df)

Here we read in the data and create a tree to represent the workflow. The tree starts are the root and can be traversed.

print(tree.children[0]) # The first child of the root

# Generative_Node('->A->B', 
#                 label='B', 
#                 module='join', 
#                 params='word=bird',
#                 kwargs={'word': 'bird', 'child': 'B', 'parent': 'A'})

# View tree
gnf.print_tree(tree)

# A [echo]
# word: cat
# |
# |-- B [join]
# |   word: bird
# |   |
# |   |-- D [join]
# |   |   word: lion
# |   |   
# |   +-- E [join]
# |       word: fish
# |       
# +-- C [join]
#     word: horse

Here the dependency structure is defined, the module required for each process, as well as the keyword arguments that module may take. All of this information is stored as object properties representing the node. As we traverse the tree to build the workflow, we can remember the dependencies as well as the properties of each process.

Module Definition

Workflow modules are templates of Nextflow processes that are populated by the node properties (kwargs).

class Modules(Components):

    @gnf.pretty_format
    def echo(self, **kwargs):
        return('''\
        process {child} {{
            output:
            stdout into {child}

            """
            printf {word}
            """
        }}
        '''.format(**kwargs))

    @gnf.pretty_format
    def join(self, **kwargs):
        return('''\
        process {child} {{
            input:
            val x from {parent}

            output:
            stdout into {child}

            """
            printf "${{x}}_{word}"
            """
        }}

Modules take any number of keyword arguments through the elegant string formatting ability of Python. When traversing the tree, for each node, we find the object method described for the specific module, and pass the parent, child, and keyword arguments to the method, writing the filled template to the workflow file.

Each module is defined as a method of a larger Modules class, which inherits a Components object that has basic snippets that common workflows and configuration files will need. Both of these can be modified, but you’ll mostly need to describe the Modules object.

m = Modules()

with open('workflow.nf', 'w') as workflow:

    # Can put arbitray code before workflow processes
    workflow.write(m.workflow_shebang())

    # Traverse the tree by a level order strategy
    for node in gnf.traverse_tree(tree):

        # Find the module specified by the node
        module = getattr(m, node.module)

        # Fill in the module template with the kwargs
        workflow.write(module(**node.kwargs))

    # Can put arbitrary code after the workflow processes
    workflow.write(m.workflow_view())

The result is a workflow file

#!/usr/bin/env nextflow
process A {
    output:
    stdout into A

    """
    printf cat
    """
}
process B {
    input:
    val x from A

    output:
    stdout into B

    """
    printf "${x}_bird"
    """
}
process C {
    input:
    val x from A

    output:
    stdout into C

    """
    printf "${x}_horse"
    """
}
process D {
    input:
    val x from B

    output:
    stdout into D

    """
    printf "${x}_lion"
    """
}
process E {
    input:
    val x from B

    output:
    stdout into E

    """
    printf "${x}_fish"
    """
}
A.view { it }
B.view { it }
C.view { it }
D.view { it }
E.view { it }

Running it looks like this:

# cat
# cat_bird
# cat_horse
# cat_bird_fish
# cat_bird_lion

This is a very basic example, but this simple syntax can scale with limited additional code up to very large workflows. If you get the point by now feel free to stop reading. Below shows how this scales up to larger workflows, including a slightly larger toy example that requires scripts, and a slightly larger than that example showing why this is useful for genomics.

Slightly More Complicated

example.csv

process,module,params
  -> A,multiply,value=5|multiplier=5
A -> B,add,value=10
A -> C,add,value=2
B -> D,add,value=4
B -> E,add,value=9

Here the modules are actually calling scripts and saving the output to files which is more realistic.

class Modules(gnf.Components):

    @gnf.pretty_format
    def multiply(self, **kwargs):
        return('''\
        process {child} {{
            publishDir "$params.output/values"

            output:
            file '*.txt' into {child}

            script:
            """
            python $PWD/scripts/multiply.py {value} {multiplier} {child}.txt
            """
        }}
        '''.format(**kwargs))

    @gnf.pretty_format
    def add(self, **kwargs):
        return('''\
        process {child} {{
            publishDir "$params.output/values"

            input:
            file prior from {parent}

            output:
            file '*.txt' into {child}

            script:
            """
            prior=\\$(cat $prior)
            python $PWD/scripts/add.py \\$prior {value} {child}.txt
            """
        }}
        '''.format(**kwargs))

We’ll also add a configuration file and add some more components to the workflow.

m = Modules()

with open('workflow.config', 'w') as config:
    config.write(m.config_manifest())
    config.write(m.config_profiles())
    config.write(m.config_params())
    
with open('workflow.nf', 'w') as workflow:

    workflow.write(m.workflow_shebang())
    workflow.write(m.workflow_version())
    workflow.write(m.workflow_header(tree))

    for node in gnf.traverse_tree(tree):
        module = getattr(m, node.module)
        workflow.write(module(**node.kwargs))

The beginning of the workflow looks like this:

#!/usr/bin/env nextflow
VERSION="1.0"
log.info """

W O R K F L O W ~ Configuration
===============================
output    : ${params.output}
-------------------------------
Hierarchy

A
|-- B
|   |-- D
|   +-- E
+-- C

"""
process A {
    publishDir "$params.output/values"

    output:
    file '*.txt' into A

    script:
    """
    python $PWD/scripts/multiply.py 5 5 A.txt
    """
}
process B {
    publishDir "$params.output/values"

    input:
    file prior from A

    output:
    file '*.txt' into B

    script:
    """
    prior=\$(cat $prior)
    python $PWD/scripts/add.py \$prior 10 B.txt
    """
}

A Real Example

In this example we’ll use the same hierarchy, except nodes represent groups of samples that need to be processed together, similar to the genomics applications previously mentioned. We’ll apply a model to the groups (leaf nods) that can optionally take prior information from models of group supersets in the hierarchy (internal and root nodes).

process,module,params
  -> A,model,cores=1|iter=10000|data=/Users/anthonyfederico/dat.rds|groups=C_D_E
A -> B,model_prior,cores=1|iter=10000|data=/Users/anthonyfederico/dat.rds|groups=D_E
A -> C,model_prior,cores=1|iter=10000|data=/Users/anthonyfederico/dat.rds|groups=C
B -> D,model_prior,cores=1|iter=10000|data=/Users/anthonyfederico/dat.rds|groups=D
B -> E,model_prior,cores=1|iter=10000|data=/Users/anthonyfederico/dat.rds|groups=E

import gnf

df = gnf.read_data("example.csv")

tree = gnf.build_tree(df)
gnf.print_tree(tree)

# A [model]
# cores: 1
# iter: 10000
# data: /Users/anthonyfederico/dat.rds
# groups: C_D_E
# |
# |-- B [model_prior]
# |   cores: 1
# |   iter: 10000
# |   data: /Users/anthonyfederico/dat.rds
# |   groups: D_E
# |   |
# |   |-- D [model_prior]
# |   |   cores: 1
# |   |   iter: 10000
# |   |   data: /Users/anthonyfederico/dat.rds
# |   |   groups: D
# |   |   
# |   +-- E [model_prior]
# |       cores: 1
# |       iter: 10000
# |       data: /Users/anthonyfederico/dat.rds
# |       groups: E
# |       
# +-- C [model_prior]
#     cores: 1
#     iter: 10000
#     data: /Users/anthonyfederico/dat.rds
#     groups: C

class Modules(gnf.Components):

    @gnf.pretty_format
    def model(self, **kwargs):
        return('''\
        process {child} {{
            cache "deep"
            publishDir "$params.output/models", pattern: "*.rds", mode: "copy"
            publishDir "$params.output/logs", pattern: "*.log", mode: "copy"

            output:
            file '*.rds' into {child}_rds
            file '*.log'

            script:
            """
            Rscript $PWD/scripts/model.R \\\\
            --data {data} \\\\
            --name {child} \\\\
            --groups {groups} \\\\
            --cores {cores} \\\\
            --iter {iter}
            """
        }}
        '''.format(**kwargs))

    @gnf.pretty_format
    def model_prior(self, **kwargs):
        return('''\
        process {child} {{
            cache "deep"
            publishDir "$params.output/models", pattern: "*.rds", mode: "copy"
            publishDir "$params.output/logs", pattern: "*.log", mode: "copy"

            input:
            file prior from {parent}_rds

            output:
            file '*.rds' into {child}_rds
            file '*.log'

            script:
            """
            Rscript $PWD/scripts/model.R \\\\
            --data {data} \\\\
            --prior ${{prior}} \\\\
            --name {child} \\\\
            --groups {groups} \\\\
            --cores {cores} \\\\
            --iter {iter}
            """
        }}
        '''.format(**kwargs))

m = Modules()

with open('workflow.config', 'w') as config:
    config.write(m.config_manifest())
    config.write(m.config_profiles())
    config.write(m.config_params())

with open('workflow.nf', 'w') as workflow:
    workflow.write(m.workflow_shebang())
    workflow.write(m.workflow_version())
    workflow.write(m.workflow_header(tree))

    for node in gnf.traverse_tree(tree):
        module = getattr(m, node.module)
        workflow.write(module(**node.kwargs))

    workflow.write(m.workflow_complete())

The beginning of the workflow looks like this.

#!/usr/bin/env nextflow
VERSION="1.0"
log.info """

W O R K F L O W ~ Configuration
===============================
output    : ${params.output}
-------------------------------
Hierarchy

A
|-- B
|   |-- D
|   +-- E
+-- C

"""
process A {
    cache "deep"
    publishDir "$params.output/models", pattern: "*.rds", mode: "copy"
    publishDir "$params.output/logs", pattern: "*.log", mode: "copy"

    output:
    file '*.rds' into A_rds
    file '*.log'

    script:
    """
    Rscript $PWD/scripts/model.R \\
    --data /Users/anthonyfederico/dat.rds \\
    --name A \\
    --groups C_D_E \\
    --cores 1 \\
    --iter 10000
    """
}
process B {
    cache "deep"
    publishDir "$params.output/models", pattern: "*.rds", mode: "copy"
    publishDir "$params.output/logs", pattern: "*.log", mode: "copy"

    input:
    file prior from A_rds

    output:
    file '*.rds' into B_rds
    file '*.log'

    script:
    """
    Rscript $PWD/scripts/model.R \\
    --data /Users/anthonyfederico/dat.rds \\
    --prior ${prior} \\
    --name B \\
    --groups D_E \\
    --cores 1 \\
    --iter 10000
    """
}

Contributing

If you find this prototype useful or you have thoughts/suggestions please let me know. Any feedback is welcome.

anfederico / generative-nf