clicumu / doepipeline

A python package for optimizing processing pipelines using statistical design of experiments (DoE).

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Break out or change the scheduler part

druvus opened this issue · comments

We discussed earlier to find a better way of submitting/distributing jobs. I have used snakemake quite a lot and I think it could be an option to use their API instead to get a more stable solution. They have support for several schedulers plus kubernetes.

I include a small example I found below

#!/usr/bin/env python3
"""
rule all:
    input:
        "reads.counts"
rule unpack_fastq:
    '''Unpack a FASTQ file'''
    output: "{file}.fastq"
    input: "{file}.fastq.gz"
    resources: time=60, mem=100
    params: "{file}.params"
    threads: 8
    log: 'unpack.log'
    shell:
        '''zcat {input} > {output}
        echo finished 1>&2 {log}
        '''
rule count:
    '''Count reads in a FASTQ file'''
    output: counts="{file}.counts"
    input: fastq="{file}.fastq"
    run:
        n = 0
        with open(input.fastq) as f:
            for _ in f:
                n += 1
        with open(output.counts, 'w') as f:
            print(n / 4, file=f)
"""

In pure python this is equivalent to the following code.

workflow.include("pipeline.conf")

shell.prefix("set -euo pipefail;")

@workflow.rule(name='all', lineno=6, snakefile='.../Snakefile')
@workflow.input("reads.counts")
@workflow.norun()
@workflow.run
def __all(input, output, params, wildcards, threads, resources, log, version):
    pass


@workflow.rule(name='unpack_fastq', lineno=17, snakefile='.../Snakefile')
@workflow.docstring("""Unpack a FASTQ file""")
@workflow.output("{file}.fastq")
@workflow.input("{file}.fastq.gz")

@workflow.resources(time=60, mem=100)
@workflow.params("{file}.params")
@workflow.threads(8)
@workflow.log('unpack.log')
@workflow.shellcmd(
    """zcat {input} > {output}
        echo finished 1>&2 {log}
        """
)
@workflow.run
def __unpack_fastq(input, output, params, wildcards, threads, resources, log, version):
    shell("""zcat {input} > {output}
        echo finished 1>&2 > {log}
        """
)


@workflow.rule(name='count', lineno=52, snakefile='.../Snakefile')
@workflow.docstring("""Count reads in a FASTQ file""")
@workflow.output(counts = "{file}.counts")
@workflow.input(fastq = "{file}.fastq")
@workflow.run
def __count(input, output, params, wildcards, threads, resources, log, version):
    n = 0
    with open(input.fastq) as f:
        for _ in f:
            n += 1
    with open(output.counts, 'w') as f:
        print(n / 4, file=f)


### End of output from snakemake --print-compilation


workflow.check()
print("Dry run first ...")
workflow.execute(dryrun=True, updated_files=[])
print("And now for real")
workflow.execute(dryrun=False, updated_files=[], resources=dict())

Another option that I have used earlier is ipython-cluster-helper but it probably other options available.

My suggestion is to sub-class or change the PipelineGenerator-class. That class parses the config-files and should contain all information needed to output a snakemake workflow. Currently new_pipeline_collection outputs a dict of script-strings, so an alternative to that function should be enough. The workflow DAG:s from our work will be very simple.

A simple executor-class is needed as well to start the pipeline and format the output to something the optimizer can use.