Syntax enhancement aka DLS-2

Question

Syntax enhancement aka DLS-2

pditommaso opened this issue 6 years ago · comments

This is a request for comments for the implementation of modules feature for Nextflow.

This feature allows the definition of NF processes in the main script or a separate library file, that can be invoked, one or multiple times, as any other routine passing the requested input channels as arguments.

Process definition

The syntax for the definition of a process is nearly identical to the usual one, it only requires the use of processDef instead of process and the omission of the from/into declarations. For example:

processDef index {
    tag "$transcriptome_file.simpleName"

    input:
    file transcriptome 

    output:
    file 'index' 

    script:
    """
    salmon index --threads $task.cpus -t $transcriptome -i index
    """
}

The semantic and supported features remain identical to current process. See a complete example here.

Process invocation

Once a process is defined it can be invoked like any other function in the pipeline script. For example:

transcriptome = file(params.transcriptome)
index(transcriptome)

Since the index defines an output channel its return value can be assigned to a channel variable that can be used as usual eg:

transcriptome = file(params.transcriptome)
index_ch = index(transcriptome)
index_ch.println()

If the process were producing two (or more) output channels the multiple assignment syntax can be used to get a reference to the output channels.

Process composition

The result of a process invocation can be passed to another process like any other function, eg:

processDef foo {
  input: 
    val alpha
  output: 
    val delta
    val gamma
  script:
    delta = alpha
    gamma = 'world'
    "some_command_here"
}

processDef bar {
  input:
    val xx
    val yy 
  output:
    stdout()
  script:
    "another_command_here"        
}

bar(foo('Hello'))

Process chaining

Processes can also be invoked as custom operators. For example a process foo taking one input channel can be invoked as:

ch_input1.foo()

when taking two channels as:

ch_input1.foo(ch_input2)

This allows the chaining of built-in operators and processes together eg:

Channel
    .fromFilePairs( params.reads, checkIfExists: true )
    .into { read_pairs_ch; read_pairs2_ch }

index(transcriptome_file)
    .quant(read_pairs_ch)
    .mix(fastqc(read_pairs2_ch))
    .collect()
    .multiqc(multiqc_file)

See the complete script here.

Library file

A library is just a NF script containing one or more processDef declarations. Then the library can be imported using the importLibrary statement, eg:

importLibrary 'path/to/script.nf'

Relative paths are resolved against the project baseDir variable.

Test it

You can try to the current implementation using the version 19.0.0.modules-draft2-SNAPSHOT eg.

NXF_VER=19.0.0.modules-draft2-SNAPSHOT nextflow run rnaseq-nf -r modules

Open points

When a process is defined in a library file, should it be possible to access to the params values? Currently it's possible, but I think this is not a good idea because makes the library depending on the script params making it very fragile.
How to pass parameters to a process defined in library files eg. For example memory and cpus settings? It could be done using config file as usual, still I expect there could be the need to parametrise the process definition and specify the parameters at invocation time.
Should a namespace be used when defining the processes in library? What if two or more processes have the same name in different library files?
One or many processes per library file? Currently it can be defined any number of processes, I'm starting to think that it would be better to allow the definition only of one process per file. This would simplify the reuse across different pipelines, the import in tools such as dockstore and it would make the dependencies of the pipeline more intelligible.
Remote library file? Not sure it's a good idea to being able to import remote hosted files e.g. http://somewhere/script.nf. Remote paths tend to change over time.
Should a versioning number be associated with the process definition? how to use or enforce it?
How test process components? ideally it should be possible to include the required contained in the process definition and unit test each process independently.
How chain a process retuning multiple channels?

LukeGoodsell · Answer 1 · Thu Dec 27 2018 23:05:32 GMT+0800 (China Standard Time)

Fantastic stuff, Paolo! I've tried it out and played with having set-based inputs and outputs and it works nicely so far. I also note that this will make unit testing individual process far easier!

My opinions on the points you raise:

Imported process should be entirely isolated from other code -- i.e. no access to mutable globals like params (is workflow mutable?) -- to prevent long-range, unintended effects. However, it'd be useful to use the params global within the imported processes. Perhaps at process invocation the params variable can be set. E.g.:
```
 index(transcriptome_file)
     .quant(read_pairs_ch)
     .mix(fastqc(read_pairs2_ch, params: [outdir: 'my_out_dir']))
     .collect()
     .multiqc(multiqc_file)
```
Personally, I'd always want the params object to be null unless otherwise specified, and to use params: params if I need to pass the global parameters, but perhaps a config value could specify whether it should take the global params value or null by default?
I would favour the config file options being inherited from the importing workflow, and other variables set at process invocation as described for params above.
Absolutely, we need different namespaces - I can imagine there being multiple processes from different packages sharing the same name. Importing each individual process would be onerous (see answer to q4 below), so namespacing will be essential. Perhaps we can declare something analogous to a package at the head of each library file, and then call package.namespace?
I think it would be very burdensome to have to import each individual process separately we have many, many process and specifying them each would be tiresome and prone to error. Much better would be to have namespacing and then have users specify the namespace and process name - process names could much more easily be unique within a single namespace.
I would never use remote file loading, but it is very convenient for one-off scripts. The more stable solution would be to have a package repo, or to be able to import an entire git repo's nextflow scripts. E.g.:
```
 importPackageFromGithub 'nf-core/nextflow-common'
```
I would version code at the package level rather than script level. As with my above answers, this reduces the amount of repeated code. Therefore, within a single project/repo, the user wouldn't specify version numbers for importing individual scripts. I also wouldn't apply version numbers within scripts (again, to reduce duplication) but only at the package level.
Unit testing might be out-of-scope here. However, the approach you've implemented so far means that it is easy to call individual processes with arbitrary inputs and act on outputs in any way desired. I would therefore hope to be able to write JUnit (or similar) tests for individual processes (or sets of processes) and be able to run them multiple times, with different parameters and configuration settings.
I would favour having an additional parameter to the process call specifying the destination of each output channel. The first null value indicates the channel should be used in the current chain. Unhandled channels should raise an exception. E.g.:
```
 myProcess(inputChannelA, inputChannelB, outputs: [outputChannel1, null, OutputChannel3])
     .subscribe { println "outputChannel2: ${it}" }
```

Francesco Strozzi · Answer 2 · Wed Jan 02 2019 23:16:42 GMT+0800 (China Standard Time)

Hello! Tried this new feature and looks amazing, thank you !

Coming to your points:

I think params values should not be accessible but on the other side I'd second the idea of defining those needed params values at import time and for the current session. Without it, libraries re-usability will be hampered imo.
I think the ideal would be to have something like

 index(transcriptome_file)
     .quant(read_pairs_ch, task: [cpus: 4, memory: '8 GB'])

where the task specific parameters can be defined at execution time, similarly to what could be done with params.

Yes, absolutely.
One process per library will definitely lead to a jungle of files to be imported/managed either locally or from a remote repository. I see the point of re-usability but it will make much more sense for the end-users to have a library which is scope specific (i.e. QC, or Salmon or even Chip-Seq or Metagenomics) and then import and combine single processes at run-time using namespaces.
That could be interesting, but I will only allow few "trusted" repositories to pull from, where code is checked and verified. It could be on GitHub or under nf-core and Nextflow URLs.
Only on the library itself, and versioning should be linked to a repository in my opinion. It should be something like Conda for instance, so no version specified means take the most recent version. If thinking about a Git repository, then libraries versions could be the tags (I find commit hashes cumbersome to use, but maybe it's just me).
I would not enforce unit testing here but hopefully, as already stated, this new feature will provide a much simpler common ground to implement testing for both libraries and pipelines using one of the many testing libraries available in Java or Groovy.
Unsure here, from one side I think Luke's idea is interesting as flagging one specific output channel to be passed to the next process is very useful. From the other side, I think processes having multiple output channels can be also branching points in the DAG and so you need to deal explicitly with the remaining output channels too, and this will break the "chain" of processes anyway.

Warren Winfried "Winni" Kretzschmar · Answer 3 · Sat Jan 05 2019 05:31:26 GMT+0800 (China Standard Time)

Great stuff indeed!

In regards to point 3: I also think that namespacing will be invaluable. I really like python's semantics in this regard (import fastqc from qctools and import qctools). However, using the point as suggested earlier (for example qctools.fastqc) would conflict with chaining. Perhaps the double colon semantics could work in that case instead? (qctools::fastqc)

Anthony Underwood · Answer 4 · Thu Jan 10 2019 02:54:12 GMT+0800 (China Standard Time)

Conversion from a monolithic script to a slim main.nf with imported processes is perfect!! Barriers were minimal.

I would second not having access to params without passing them explicitly, but I would need some way of accessing them since many of my processes have a conditional that executes a different variant of a script depending on a param.

If it were possible to use process rather than processDef it would be cleaner but I can live with that difference. Perhaps the keyword moduleProcess would be more explicit.

Michael Imelfort · Answer 5 · Tue Jan 15 2019 08:23:21 GMT+0800 (China Standard Time)

First, this looks awesome. I'm working with a few people to build some pretty complex NF stuff and this type of thing should make our lives much much easier. 🎉

As for RFC:

I don't think the modules should be given any access the params object. It just encourages bad habits. If the use really wants globals then they could just define them via the config file.
Would it be possible to expose an api to the object / class (or create one) that actual config files get boiled down to. Then each process could work out it's config in the usual way or we could do something like this:

# define the process (assuming that param ordering under 'input:' matches the ordering used when calling
processDef assemble_minia {
    input:
    file $reads from reads
    val $prefix from prefix

    output:
    file "${prefix}asm.gfa" into gfa

    script:
    """
    minia -kmer-size 31 -in $reads -out ${prefix}asm
    """
}

And then when we use it:

# load a config file -> All values in this file override any prev set values
reads = file(params.reads)
assemble_minia.load_config("/path/to/file/or/similar")
assem_ch = assemble_minia(reads, "with_custom_config")

Or just update config values individually:

# change the container only - specifically override one value
# NOTE: accessing "params" values outside of processDef
assemble_minia.set_container("${params.docker_repository}company/minia:${params.minia_old_commit}")
old_assem_ch = assemble_minia(reads, "old_version")

assemble_minia.set_container("${params.docker_repository}company/minia:${params.minia_new_commit}")
new_assem_ch = assemble_minia(reads, "new_version")

Paolo Di Tommaso · Answer 6 · Thu Jan 24 2019 23:29:10 GMT+0800 (China Standard Time)

Thanks a lot for all your comments! I've uploaded another snapshot introducing some refinements and suggestions you provided:

NXF_VER=19.0.0.modules-draft3-SNAPSHOT nextflow info

Main changes:

I've realised that adding the processDef keyword could be confusing and above not strictly necessary. In this version, when process is used in the main script, it works as usual, instead when it's used in a module definition file, it allows to define a process and therefore from/into should not be used.
importLibrary as been replaced by require that's a bit more readable.
Parameters. I agree with you that modules should be isolated from command line parameters. At the same time I think there should be a way to inject options to a module component when it's referenced. this would allow the parametrisation of the inner tasks. In last snapshot I've added the possibility to specify a map of values when the module is referenced via the require statement, e.g.
```
require 'module-file.nf', params: [ foo: val1, bar: val2 ]
```

Then in module-file.nf we can have the usual syntax for params as in the main script:

   params.foo = 'x'
   params.bar = 'y'

   process something {
    ''' 
    your_command --here
    '''
   }

Namespace. It can be useful, but I don't think it's dramatically urgent. I think we can add in a separate iteration.
Remote module repository. The idea is tempting, it could work along the same line of the nextflow pull command. The module is downloaded from a Git repository and commit ID or tag can be used to identified a specific version. For example:
```
require from: 'nf-core/common-lib', revision: <tag-name>
```

These are the main points. In the next iteration I will try to extend the module concept to allow the definition also of custom functions that can be imported both the in the config and script context.

Anthony Underwood · Answer 7 · Fri Jan 25 2019 01:36:33 GMT+0800 (China Standard Time)

Thanks for the update @pditommaso.

To clarify on injection of modules. If you wanted to inject params that has been passed as arguments to the nextflow run command would you do something like below to have default values that could be overridden by args on the nextflow run command line and then passed on to the module?

params.foo = false
params.bar = 50

require 'module-file.nf', params: [ foo: params.foo , bar: params.bar ]

Paolo Di Tommaso · Answer 8 · Fri Jan 25 2019 01:50:06 GMT+0800 (China Standard Time)

Yes, exactly like that, you can even do

require 'module-file.nf', params: params

Tho both ways are the only thing that I don't like in this approach.

Mike Smoot · Answer 9 · Thu Jan 31 2019 08:10:17 GMT+0800 (China Standard Time)

Of course you release this feature after I can't use nextflow anymore. Sigh. :)

I think this feature looks great. Reading through this it seems like this only lets you separate and reuse the definition of single processes, but it doesn't have a way of collecting or aggregating multiple processes into single entity (like a subworkflow). Is that right? Have you given any thought to that or is that still future work?

Regardless, I think this is awesome and I'll continue to wish I was using nextflow instead of what I'm using now...

Paolo Di Tommaso · Answer 10 · Thu Jan 31 2019 21:53:26 GMT+0800 (China Standard Time)

@mes5k Ah-ah, you have to back to NF !

but it doesn't have a way of collecting or aggregating multiple processes into single entity (like a subworkflow)

This approach is extremely flexible and the idea is to use a similar mechanism also for sub-workflows.

Mike Smoot · Answer 11 · Fri Feb 01 2019 11:26:16 GMT+0800 (China Standard Time)

Awesome! So happy to hear that you're working on this. Will definitely make the job of selling nextflow internally easier!

Paolo Di Tommaso · Answer 12 · Mon Feb 11 2019 19:09:34 GMT+0800 (China Standard Time)

Uploaded 19.0.0.modules-draft4-SNAPSHOT that allows the definition of custom function and nested require inclusions. You can see in action in this pipeline CRG-CNAG/CalliNGS-NF@1cad86b

However still not happy, I'll try experimenting with the ability to define subworkflows.

Alexey Diushen · Answer 13 · Mon Feb 18 2019 21:29:05 GMT+0800 (China Standard Time)

@pditommaso does this feature relate to #238 and also #777, #844? I guess, yes.
Please, keep in mind and consider also the following features:

dry-run or plan to see the end graph structure;
print output channels(variables) if the value can be inferred and doesn't have dependencies;
execution of specified file, module or process to be able to run isolated part;
syntax checking for *.nf files;

It makes sense to allow to run a target process or module of very large script separately like a portion of work. Just look the definition of targeting for Terraform tool. It makes possible to uniquely refer to each module or any resource or data source within any provider context by full qualified item name. So, examples of CLI for NF can be written as:

nextflow run -target=process.1A_prepare_genome_samtools
nextflow run -target=module.'rnaseq.nf'.fastqc
nextflow plan -target=process.1A_prepare_genome_samtools
nextflow plan -target=module.'rnaseq.nf'.fastqc

Besides introducing the modules feature to extract common code to a separate file I hope it will lead to implementation of the described above features because they are useful and desired.

Paolo Di Tommaso · Answer 14 · Mon Feb 18 2019 21:56:59 GMT+0800 (China Standard Time)

#238 yes, the others are out of the scope of this enhancement.

Alexey Diushen · Answer 15 · Tue Feb 19 2019 00:24:40 GMT+0800 (China Standard Time)

@pditommaso let's assume that the feature is done and can be released as an experimental.
Let's simply add an extra -enable-modules option which will enable new module feature. It will save backward compatibility and allow end users to test this feature. It's compromise when you need a new release and feed-back. For example, an experimental -XX:+UnlockExperimentalVMOptions flag for Java 11 in release notes.

Paolo Di Tommaso · Answer 16 · Tue Feb 19 2019 00:42:14 GMT+0800 (China Standard Time)

That's the plan.

Alexey Diushen · Answer 17 · Tue Feb 19 2019 19:20:44 GMT+0800 (China Standard Time)

It is worth to add a version designation to the nf script to help end user identify version and produce clear error descriptions. For example:

apiVersion: "nextflow.io/v19.0.0-M4-modules"
   or
dslVersion: "nextflow.io/v19.0.0-M4-modules"

where M is stands for milestone.

Paolo Di Tommaso · Answer 18 · Thu Feb 21 2019 01:51:46 GMT+0800 (China Standard Time)

Ok, just upload 19.0.0.modules-draft5-SNAPSHOT. Things starts to become exciting, it's ~~not~~ now possibile to define subworkflow either the module script or in the main script composing the defined processes e.g.

process foo {
   /your_command/ 
}

process bar {
  /another_command/
}

workflow sub1 {
  foo()
  bar()  
}

Then invoke it as a function ie. sub1. Sub-workflows can have parameter as regular function e.g.

 workflow sub1(ch_x, ch_y) {
  foo(ch_x)
  bar(ch_y)  
}

The output of the last invoked process (bar) is implicitly the output of the sub-workflow and it can be referenced in the outer scope a sub.output.

In the main script it can be defined an anonymous workflow that's supposed to be the application entry-point and therefore it's implicitly executed e.g.

fasta  = Channel.fromPath(...)
reads = Channel.fromFilePairs(...)
workflow {
  sub1( fasta, reads )
}

Bonus (big one): within a workflow scope the same channel can be used as input in different processes (finally!)

Mike Smoot · Answer 19 · Thu Feb 21 2019 13:11:03 GMT+0800 (China Standard Time)

Hi @pditommaso I've started experimenting and I'm having a hard time getting something working. I'm getting this error:

[master]$ NXF_VER=19.0.0.modules-draft5-SNAPSHOT nextflow run main.nf
N E X T F L O W  ~  version 19.0.0.modules-draft5-SNAPSHOT
Launching `main.nf` [boring_kare] - revision: 66747d681c
ERROR ~ No such variable: x

 -- Check script 'main.nf' at line: 8 or see '.nextflow.log' file for more details

With this code: https://github.com/mes5k/new_school_nf

Can you point me in the right direction?

Paolo Di Tommaso · Answer 20 · Thu Feb 21 2019 15:02:45 GMT+0800 (China Standard Time)

The processes can only be defined in the module script (to keep compatibility with existing code).

In the main there must be a workflow to enable the new syntax. Finally the operator like syntax was removed because I realised that was useful only on a restricted examples and generating confusing in most cases. You example should be written as:

   to_psv(to_tsv(gen_csv(ch1)))

or

gen_csv(ch1)
to_tsv(gen_csv.outout)
to_psv(to_tsv.output)

Mike Smoot · Answer 21 · Thu Feb 21 2019 23:05:10 GMT+0800 (China Standard Time)

Awesome, thanks! My first example is now working.

My next experiment was to see if I could import an entire workflow. I can't tell from your comments whether that's something that's supported or whether I've just got a mistake in my code.

Anthony Underwood · Answer 22 · Thu Feb 21 2019 23:31:17 GMT+0800 (China Standard Time)

Is it possible to assign module process outputs to a variable so that you can do something like

modules.nf

process foo {
    input:
    file(x)

    output:
    file(y)

   script:
    .....
}

process bar {
    input:
    file(a)

    output:
    file(b)

   script:
    .....
}

main.nf

require 'modules.nf'

workflow {
  Channel
    .from('1.txt', '2.txt', '3.txt')
    .set{ ch1 }

  foo_output = foo(ch1)
  bar_output = bar(foo_output)

  bar_output.view()
}

Paolo Di Tommaso · Answer 23 · Thu Feb 21 2019 23:43:27 GMT+0800 (China Standard Time)

Yes, but it's not necessary. The process can be accessed as a variable to retried the the output value ie

workflow {
  Channel
    .from('1.txt', '2.txt', '3.txt')
    .set{ ch1 }

  foo(ch1)
  bar(foo.output)
  bar.output.view()
}

Paolo Di Tommaso · Answer 24 · Thu Feb 21 2019 23:46:49 GMT+0800 (China Standard Time)

My next experiment was to see if I could import an entire workflow

You can define the workflow logic as a sub-workflow, then invoke it ie.

workflow my_pipeline(ch) {
  gen_csv(ch)
  to_tsv(gen_csv.outout)
  to_psv(to_tsv.output)
}

workflow {
  ch1 = Channel.fromPath(params.something)
  my_pipeline( ch1 )
}

Anthony Underwood · Answer 25 · Thu Feb 21 2019 23:47:21 GMT+0800 (China Standard Time)

OK cool thanks. Also you mentioned that you can reuse a channel. Can you therefore do

workflow {
  Channel
    .from('1.txt', '2.txt', '3.txt')
    .set{ ch1 }

  foo(ch1)
  bar(foo.output)
  baz(foo.output)

  bar.output.view()
  baz.output.view()
}

Paolo Di Tommaso · Answer 26 · Thu Feb 21 2019 23:50:25 GMT+0800 (China Standard Time)

I'm also playing with this idea:

process foo {
   /your_command/ 
}

process bar {
  /another_command/
}

workflow sub1 {
  Channel.from(something) | foo | bar | view()
}

remind you something ? 😆😆

Paolo Di Tommaso · Answer 27 · Thu Feb 21 2019 23:51:03 GMT+0800 (China Standard Time)

Also you mentioned that you can reuse a channel. Can you therefore do

YES!

Mike Smoot · Answer 28 · Fri Feb 22 2019 02:33:32 GMT+0800 (China Standard Time)

Was gonna suggest considering a pipe operator! Railway oriented programming demonstrates this nicely. I think it's worth thinking about chaining processes vs. chaining operators and how the two might mix and match.

Paolo Di Tommaso · Answer 29 · Fri Feb 22 2019 02:43:06 GMT+0800 (China Standard Time)

Oh! Railway oriented programming .. didn't know! The | as pipe operator it would be nice because everybody knows the meaning in Bash. Tho in some context it's used to express the parallel execution, it could be even done something like:

channel >> foo >> (bar | baz)

Where >> means pipe instead | the parallel execution ..

Mike Smoot · Answer 30 · Fri Feb 22 2019 02:55:47 GMT+0800 (China Standard Time)

Yup, and with a pipe operator you're getting dangerously close to monads and bind (>>=) operators. I think is fantastic and am super excited about it, but I also think its worth taking a long time to think about this because it's worth getting right.

Paolo Di Tommaso · Answer 31 · Fri Feb 22 2019 02:58:32 GMT+0800 (China Standard Time)

another couple of years :D

Paolo Di Tommaso · Answer 32 · Sat Feb 23 2019 19:24:12 GMT+0800 (China Standard Time)

Let's say we have a parallel execution as (foo | bar | baz), what is supposed to be the resulting output? an array of three channels corresponding the respective processes ?

Anthony Underwood · Answer 33 · Sat Feb 23 2019 19:56:34 GMT+0800 (China Standard Time)

Although I see the expressiveness of

workflow {
  channel >> foo >> (bar | baz)
}

I fear that this will end up like some of perl one liners that are knocking around - powerful yet nearly impossible to decode

I prefer

workflow {
  foo(channel)
  bar(foo.output)
  baz(foo.output)
}

IMO that will be easier to read than a long chain of processes, particularly for workflows with many processes.

If there was an option for a chaining/piping syntax such as (foo | bar | baz) I would prefer that we access the outputs explicitly as a dictionary with keys foo, bar and baz

Paolo Di Tommaso · Answer 34 · Mon Feb 25 2019 16:33:20 GMT+0800 (China Standard Time)

Yes, I agree that potentially it could become a too cryptic syntax, but it's worth to experiment with it. It could also be very expressive.

If there was an option for a chaining/piping syntax such as (foo | bar | baz) I would prefer that we access the outputs explicitly as a dictionary with keys foo, var and baz

Actually there *is* and being a mere invocation of the process it's always possible to access the process output as foo.output, etc.

Paolo Di Tommaso · Answer 35 · Mon Feb 25 2019 16:55:29 GMT+0800 (China Standard Time)

I've uploaded a new iteration 19.0.0.modules-draft6-SNAPSHOT. The most important change is since now on modules feature needs to be activated adding the following statement at the beginning of the script.

nextflow.enable.modules = true

This allows to declare process modules also in the main script without breaking existing code.

It also implements a very experimental pipe operators as sketched above eg.

 workflow {
  channel >> foo >> (bar | baz)
}

I've also started to draft the documentation to help you to evaluate this feature. You can find at this link.

What's next

Namespace: now that's possible to define process components in the main script, I'm more convinced that the ones defined in a separate script should be included and referenced using a separate namespace.
Workflow inputs/outputs definition: I'm still not so happy with the current function-like schema for (sub) workflow inputs definition. Also there's no definition for output.
Extends the support for pipes syntax to channel operators ie. map, collect, etc

Hadrien Gourlé · Answer 36 · Tue Feb 26 2019 14:10:23 GMT+0800 (China Standard Time)

I love the pipe idea, but wouldn't it be better to use |> like it already exists in Elixir and currently in proposal stage for javascript?

>> is bitwise shift in a lot of language (and redirection in bash 😛).

Paolo Di Tommaso · Answer 37 · Tue Feb 26 2019 14:26:42 GMT+0800 (China Standard Time)

We are restricted to the operators provided by groovy http://groovy-lang.org/operators.html#Operator-Overloading

channel | foo | (bar & baz)

sounds better? 😂

Hadrien Gourlé · Answer 38 · Tue Feb 26 2019 22:12:45 GMT+0800 (China Standard Time)

I didn't know you couldn't define your own operator in groovy, my bad!

Then I really have no opinion in the matter, >> or | would do the job fine I guess 😉

Anthony Underwood · Answer 39 · Thu Feb 28 2019 08:50:30 GMT+0800 (China Standard Time)

We are restricted to the operators provided by groovy http://groovy-lang.org/operators.html#Operator-Overloading
channel | foo | (bar & baz)
sounds better? 😂

I certainly prefer | to >> only because of the common usage in bash and IMO bar & baz is a bit more intuitive

Mike Smoot · Answer 40 · Fri Mar 01 2019 02:28:40 GMT+0800 (China Standard Time)

I probably prefer | too given the operator overloading restriction. However, as I think about it I'm having trouble figuring out why we can't treat processes as special operators. A process just transforms data on a channel, it just happens in a separate thread, whereas operators run in the main thread.

@pditommaso I know that you removed the ability to treat these processes as operators, but can you explain the logic behind that? It seems that something like:

channel.foo().into{ x, y }
x.bar()
y.baz()

Would be a little closer to how things have happened in nextflow in the past. Just trying to wrap my head around things!

Paolo Di Tommaso · Answer 41 · Fri Mar 01 2019 04:08:36 GMT+0800 (China Standard Time)

Mostly because the syntax would clash with namespace declaration. In the next iteration a process foo declared in the module file x need to be invoked as x.foo(). Therefore would not make much sense the syntax ..

But here is here it comes the pipe operator e.g .

channel | x.foo | ( x.bar & x.baz )

Mike Smoot · Answer 42 · Fri Mar 01 2019 04:20:26 GMT+0800 (China Standard Time)

Would Groovy allow you to use :: as the namespace separator? We could pretend we're C++ programmers.

Paolo Di Tommaso · Answer 43 · Fri Mar 01 2019 04:22:40 GMT+0800 (China Standard Time)

LOL. Currently no, but likely it will in the future as :: is now also a Java operator. However | and & are a good compromise IMO, as NF targets more Bash programmers than C++ ones 😉

Mike Smoot · Answer 44 · Fri Mar 01 2019 04:27:34 GMT+0800 (China Standard Time)

Cool, let us know when there's a build with namespaces available so we can experiment!

Paolo Di Tommaso · Answer 45 · Sat Mar 02 2019 02:51:47 GMT+0800 (China Standard Time)

Just uploaded a new snapshot 19.0.0.modules-draft7-SNAPSHOT. I think we are approaching stable implementation. The most notable thing is the ability to define module namespace. Also require has been replaced by include to be consistent with the exiting includeConfig.

Also there's a preliminary implementation for processes piping operator. More details in the docs page.

Alexander Peltzer · Answer 46 · Tue Mar 05 2019 17:52:26 GMT+0800 (China Standard Time)

One question related to this emerging feature. One idea of this e.g. in the nf-core project would be that we build up a "module library" that can be used and shared by all projects in general, thus making all these small pipeline modules available to the wider community. This would make fixing issues much easier, however I see some issues/potential for issues arise as well:

What happens when we e.g. define a fastqc module, but this is changed in an upstream module?

Are there "interfaces" implemented that define how the module has to look like and then producing an error message when this doesn't fit (anymore?). How would one test this when updating the module? Might be that this is already there and/or I'm missing something....

Paolo Di Tommaso · Answer 47 · Tue Mar 05 2019 18:05:33 GMT+0800 (China Standard Time)

Regarding this, the idea is to add the ability to import modules from Git repos providing the commit id or tag name. This should solve the problem about changing versions.

Harshil Patel · Answer 48 · Tue Mar 05 2019 18:11:11 GMT+0800 (China Standard Time)

I wonder if it's a good idea to release a collection of "module scripts" in phases like Bioconductor, or whether it's easier to commit and reference each module separately. The former approach means you would have to provide a single commit/release id for the entire pipeline. The latter approach is more flexible but will require a bit more tracking.

Paolo Di Tommaso · Answer 49 · Tue Mar 05 2019 18:14:27 GMT+0800 (China Standard Time)

I think we can have both. It would be enough to organise the repo as a collection of scripts, then NF could include one or many (with the same version id)

Alexander Peltzer · Answer 50 · Tue Mar 05 2019 18:24:19 GMT+0800 (China Standard Time)

I agree - would be possible both :-)

Michael Imelfort · Answer 51 · Wed Mar 06 2019 07:59:50 GMT+0800 (China Standard Time)

WRT to the "core packages" idea and versioning. If the plan is just to reference git repos then it sounds like NF is about to set off on the path of building a package management system the hard way. I don't mean to sound harsh, but this flavour of problem has been attacked many many times and it's not an easy thing to get right by accident. I know what I'm saying is perilously close to off piste but Is there any appetite for planning / designing a package management system for NF modules? Would something like nix be simple, yet powerful enough to do the job you need? I've seen it used to handle npm modules so I'm assuming it's fit for purpose. See: https://github.com/svanderburg/node2nix

…

On Tue, 5 Mar 2019 at 20:24, Alexander Peltzer ***@***.***> wrote: I agree - would be possible both :-) — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#984 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAoGdktueyKhOgH6iY_ig5dg3nvpt75fks5vTkXVgaJpZM4Zis1h> .

Mike Smoot · Answer 52 · Thu Mar 07 2019 02:04:48 GMT+0800 (China Standard Time)

I'm struggling with how much package management is needed. If the nextflow files explicitly specify git repos with tags/hashes in the include statements, then do we really need an additional package manager? Nextflow already has support for cloning and tracking repos. Is there an advantage to tracking files and versions using a separate mechanism?

Paolo Di Tommaso · Answer 53 · Thu Mar 07 2019 02:08:28 GMT+0800 (China Standard Time)

Tend to agree. I would exclude any dependency with an external tool/package manager.

Sam Minot · Answer 54 · Thu Mar 07 2019 04:25:59 GMT+0800 (China Standard Time)

Cromwell just lets you import from a URL, which supports referencing specific GitHub commits. Probably no reason to over-engineer beyond that functionality.

Paolo Di Tommaso · Answer 55 · Thu Mar 07 2019 16:35:07 GMT+0800 (China Standard Time)

Allowing deps from plain URL is just evil because it can change or break at any time.

Paolo Di Tommaso · Answer 56 · Thu Mar 07 2019 22:47:09 GMT+0800 (China Standard Time)

Regarding the modules inclusion with namespace, I've made some tests but I'm not convinced by this solution. The problems are:

it makes the implementation much more complex for a feature that most people won't use (read as over-engineering)
make the NF more verbose because you need to prefix each imported process or workflow with the module name .. boring
break the configuration for processes whose config is defined by the name, because the name is supposed to be module_name.process_name.

For these reasons I think to use ad different model that allows the inclusion of specific processes from a module file which it would be optionally possible to give an alias for example:

include FOO from 'module/path'
include FOO from 'module/path' as BAR

The first imports the FOO process in the current script from the specified module file, the second syntax import the same process but it will be referenced as BAR.

include 'module/path'

The above syntax allows the import of all components defined in the module.

Thoughts ?

Warren Winfried "Winni" Kretzschmar · Answer 57 · Thu Mar 07 2019 23:11:28 GMT+0800 (China Standard Time)

+1

…

On Thu, Mar 7, 2019 at 3:47 PM Paolo Di Tommaso ***@***.***> wrote: Regarding the modules inclusion with namespace, I've made some tests but I'm not convinced by this solution. The problems are: 1. it makes the implementation much more complex for a feature that most people won't use (read as over-engineering) 2. make the NF more verbose because you need to prefix each imported process or workflow with the module name .. boring 3. break the configuration for processes whose config is defined by the name, because the name is supposed to be module_name.process_name. For these reasons I think to use ad different model that allows the inclusion of specific processes from a module file which it would be optionally possible to give an alias for example: include FOO from 'module/path' include FOO from 'module/path' as BAR The first imports the FOO process in the current script from the specified module file, the second syntax import the same process but it will be referenced as BAR. include * from 'module/path' include 'module/path' The above syntax are equivalent and allows the import of all components defined in the module. Thoughts ? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#984 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASnPqQ5H0d5n9gsWmDKlJVtevfiXnKMks5vUSZzgaJpZM4Zis1h> .

-- Post-doctoral researcher School of Engineering Sciences in Chemistry, Biotechnology and Health Department of Gene Technology SciLifeLab KTH Royal Institute of Technology Stockholm, Sweden

Michael Imelfort · Answer 58 · Fri Mar 08 2019 11:49:28 GMT+0800 (China Standard Time)

I'm curious, let's say it's sometime in the future and I have complicated workflow with the following charactaristics: - it uses some "core" modules officially supported by some official nextflow community - it also uses some modules a collaborator has developed - it also used some modules I've developed - both my modules and my collaborators modules use some "core" modules, perhaps some of them are the same but are different versions Now, suppose I've been using this pipeline for a while and I want to give it a refresh. Typically for tasks like this I'd go find a requirements.txt or something similar and update the version strings there. Perhaps there are even some modules which don't have a version specified or a >= version spec so I'd update the system modules. If my collaborator has specified module dependencies in this fashion I'd be updating their dependencies without modifying their code, which may be terrible or may be needed (eg: a security related update for a poorly supported module) It seems like in the setup being suggested to do this refresh I'd need to modify all the nextflow files individually and make them point to the updated modules. And in the case of the collaborator with poor support I'd need to have my own fork of their work to use any updated modules. Is this correct?

…

On Thu, 7 Mar 2019, 18:35 Paolo Di Tommaso, ***@***.***> wrote: Allowing deps from plain URL is just evil because it can change or break at any time. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#984 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAoGdpxgnNCWoizuQ1ridM8lc2JEfhnAks5vUM8_gaJpZM4Zis1h> .

Mike Smoot · Answer 59 · Fri Mar 08 2019 12:28:55 GMT+0800 (China Standard Time)

@minillinim my gut reaction is yeah, you're on the hook for updating versions if you need a global refresh like that. I also think it's a bit unlikely scenario, but OK, we can use it as an example. Can you articulate how you'd like it work? Maybe consider opening a new feature request so that the conversation can be tracked separately.

Stijn van Dongen · Answer 60 · Sat Mar 23 2019 00:14:50 GMT+0800 (China Standard Time)

Just registering my genuine interest here. The modules feature creates an abstraction layer; this can be very convenient (re-use) but raises the question of scope, as perhaps indicated by the discussions relating to the hooks for params, channels, config settings (and directives such as when and publishDir). If one were to add a few of these into this framework, then it gets very close to the actual process definition itself - with only the script section missing. What I like a lot about NF is its expressiveness; a NF process is already very neatly and tidily packaged in a declarative way with not much ballast. In some processes we may also have parameters that will lead to different script settings (i.e. set/unset/change a command line parameter). I have in-depth experience with just a single pipeline, so my perspective is probably very limited. From this perspective, the module feature will be useful for pervasive tasks (say fastqc) that need very little context. Could the when directive be incorporated into this feature?

Paolo Di Tommaso · Answer 61 · Mon Mar 25 2019 15:35:54 GMT+0800 (China Standard Time)

These are good points. However it may not be clear since the syntax changed other the time, the new process syntax does not impose the use of a separate module file.

You will still be able to use a single script approach, in which processes will be declared and then invoked. eg.

process foo {
  input:
  file x 
  script:
  """
  your_command --in $x
  """
} 

Channe.fromPath('/some/data/*') | foo()

The when won't change, but I agree with you that the use of some directive such as when and publishDir can be problematic when using separate module files and even not suggested any more.

Mike Smoot · Answer 62 · Tue Mar 26 2019 01:14:29 GMT+0800 (China Standard Time)

If we shouldn't be using publishDir in modules (and I can understand why), then perhaps we should add a publishDir operator that does the same thing?

Paolo Di Tommaso · Answer 63 · Tue Mar 26 2019 01:16:39 GMT+0800 (China Standard Time)

Yes, was thinking something along these lines. Tho I'm starting to think the (sub)workflow should have it's own directives to declare inputs/outpus/publish etc. This remains an open point.

Warren Winfried "Winni" Kretzschmar · Answer 64 · Tue Mar 26 2019 18:26:43 GMT+0800 (China Standard Time)

I'm just going to chime in here and say that I would love to see a publishDir operator.

Stijn van Dongen · Answer 65 · Tue Mar 26 2019 18:35:18 GMT+0800 (China Standard Time)

With a publishDir operator, you'd need to expose file naming conventions used by the process I assume, or introduce a declarative layer describing its outputs that internally maps to file name globs.

Roberto Spreafico · Answer 66 · Fri Mar 29 2019 08:56:43 GMT+0800 (China Standard Time)

Tried modules with draft7 - it's awesome, built a whole pipeline with them, flawlessly (hopefully specs won't change much now!! :-P). Only noticed that errors tend to be more cryptic than previously.

A few questions (some asked above):

how to publish final results?
is when going to be dropped entirely or replaced with an equivalent construct?
can you call the same process from a module twice? (e.g. by importing twice the same module with different aliases)

Finally, I noticed that if one imports like this:

include 'modules/fastqc.nf' as fastqc

and then in a workflow invokes like this:

fastqc.fastqc(reads)

Then the output is available from fastqc.output (not fastqc.fastqc.output).

Roberto Spreafico · Answer 67 · Fri Mar 29 2019 13:22:32 GMT+0800 (China Standard Time)

Also, how to define a nextflow.config file with profiles affecting e.g. cpu allocations to tasks that may be imported?

Paolo Di Tommaso · Answer 68 · Mon Apr 01 2019 14:41:37 GMT+0800 (China Standard Time)

Tried modules with draft7

The lastest is draft10 and actually change quite a bit. Have look here

how to publish final results?

This remains an open point to be decided

is when going to be dropped entirely or replaced with an equivalent construct?

For now, it won't change, but likely it will be less useful. The use if should cover use cases

can you call the same process from a module twice? (e.g. by importing twice the same module with different aliases)

Yes, you wont need to import with a different alias. Each invocation returns it's own instance that can be safely accessed.

Regarding the last point the include syntax change also to avoid to mess-up the process naming in the config file. See here

Maxime U Garcia · Answer 69 · Mon Apr 01 2019 22:31:25 GMT+0800 (China Standard Time)

What is the version that we should use to try the draft10?
I tried 19.0.0.modules-draft10-SNAPSHOT, and it doesn't seems to work.

Paolo Di Tommaso · Answer 70 · Tue Apr 02 2019 01:18:45 GMT+0800 (China Standard Time)

Unfortunately didn't upload this version. You need to check modules-draft10 branch and compile it.

Mike Smoot · Answer 71 · Tue Apr 02 2019 03:28:08 GMT+0800 (China Standard Time)

Playing with modules-draft10 I notice that this code fails:

some_process(in_chan) 
     | map { x -> x*2 }
     | some_other_process

While this code works:

some_process(in_chan) | map { x -> x*2 } | some_other_process

as does this:

some_process(in_chan) \
     | map { x -> x*2 } \
     | some_other_process

It would be nice if | could handle newlines.

Mike Smoot · Answer 72 · Tue Apr 02 2019 06:19:41 GMT+0800 (China Standard Time)

I'm also noticing that some errors don't propagate to the UI. For example this code

my_process(in_chan) \
    | another_proc \
    | set { out_chan }

Results in an exception to .nextflow.log, but nothing to the user. The pipeline prints nothing and then exits. I've seen this a bunch as I'm trying to figure out the syntax.

Roberto Spreafico · Answer 73 · Tue Apr 02 2019 07:26:09 GMT+0800 (China Standard Time)

@mes5k Not sure if it helps, but with draft7 the following (with pipes before newline breaks, rather than after) worked:

Channel.fromPath("input/*.fastq.gz") |
    filter{ it.exists() } | 
    ifEmpty { error "No reads found!" }

@pditommaso Thanks for the update, luckily draft7 is not so different from draft10. One follow up question though: if you don't alias two instances (or two imports) of the same process, how do you handle output using the dot notation? E.g. if I call fastqc twice, how does Nextflow know which fastqc.output I am referring to later?

Paolo Di Tommaso · Answer 74 · Tue Apr 02 2019 14:46:16 GMT+0800 (China Standard Time)

It would be nice if | could handle newlines.

I have little control over the lexer level of the parser, unfortunately.

I'm also noticing that some errors don't propagate to the UI. For example this code

It would be very useful to have a replicate snippet and the resulting log file

if you don't alias two instances (or two imports) of the same process, how do you handle output using the dot notation?

Well, you are supposed to use access the output before the following invocation.

final @maxulysse I've uploaded 19.0.0.modules-draft10-SNAPSHOT finally.

Roberto Spreafico · Answer 75 · Tue Apr 02 2019 22:35:11 GMT+0800 (China Standard Time)

Well, you are supposed to use access the output before the following invocation

There are cases where this might not be possible. In that case, would importing twice under different aliases work? If not, could aliasing at invokation time be supported?

Paolo Di Tommaso · Answer 76 · Wed Apr 03 2019 01:22:33 GMT+0800 (China Standard Time)

There are cases where this might not be possible.

Don't forget you can also assign a process output to a variable. The processName.outout is supposed to be a shortcut to avoid variables proliferation.

Roberto Spreafico · Answer 77 · Wed Apr 03 2019 01:28:41 GMT+0800 (China Standard Time)

Good point, sounds like this is the best way to achieve that as it is truly unambiguous. Thanks @pditommaso !

Roberto Spreafico · Answer 78 · Tue Apr 09 2019 13:41:13 GMT+0800 (China Standard Time)

@pditommaso assigning the processName.output to variables doesn't work for me with draft10.

Tried the following:

fastqc(reads | flatMap { x -> x[2] })
fastqc_raw_output = fastqc.output

// more stuff here

fastqc(trimmomatic.output[0] | flatMap { x -> x[2] })
fastqc_trimmed_output = fastqc.output

as well as variations, such as direct assignment in only one line rather than two, with or without .output appended. One call to fastqc works fine, two calls result in an error elsewhere in the code, completely unrelated.

Roberto Spreafico · Answer 79 · Wed Apr 10 2019 01:58:05 GMT+0800 (China Standard Time)

Like @mes5k I am also finding errors in .nextflow.log that do not propagate to the UI, such as:

Apr-09 10:42:10.058 [main] DEBUG nextflow.Session - Session aborted -- Cause: No signature of method: groovyx.gpars.dataflow.DataflowBroadcast.getAt() is applicable for argument types: (Integer) values: [0]

when using draft10

Paolo Di Tommaso · Answer 80 · Wed Apr 10 2019 03:48:08 GMT+0800 (China Standard Time)

@rspreafico-vir I need some hints on the issue, what's the code causing the error and the complete stack trace.

Roberto Spreafico · Answer 81 · Wed Apr 10 2019 04:17:53 GMT+0800 (China Standard Time)

Thanks @pditommaso ! Here is the stack trace prior to the error posted in my previous comment:

Apr-09 10:41:55.373 [main] INFO  nextflow.cli.CmdRun - N E X T F L O W  ~  version 19.0.0.modules-draft10-SNAPSHOT
Apr-09 10:41:55.392 [main] INFO  nextflow.cli.CmdRun - Launching `main.nf` [thirsty_gilbert] - revision: a765aedefd
Apr-09 10:41:55.405 [main] DEBUG nextflow.config.ConfigBuilder - Found config local: /Users/rspreafico/workspace/nf-rnaseq/nextflow.config
Apr-09 10:41:55.406 [main] DEBUG nextflow.config.ConfigBuilder - Parsing config file: /Users/rspreafico/workspace/nf-rnaseq/nextflow.config
Apr-09 10:41:55.474 [main] DEBUG nextflow.config.ConfigBuilder - Applying config profile: `standard`
Apr-09 10:41:56.096 [main] DEBUG nextflow.Session - Session uuid: b38b6eb5-e329-48e5-bfab-e110d505470b
Apr-09 10:41:56.096 [main] DEBUG nextflow.Session - Run name: thirsty_gilbert
Apr-09 10:41:56.097 [main] DEBUG nextflow.Session - Executor pool size: 4
Apr-09 10:41:56.121 [main] DEBUG nextflow.file.FileHelper - Creating a file system instance for provider: S3FileSystemProvider
Apr-09 10:41:56.127 [main] DEBUG nextflow.Global - Using AWS credentials defined in nextflow config file
Apr-09 10:41:56.128 [main] DEBUG nextflow.file.FileHelper - AWS S3 config details: {secret_key=REDACTED, region=us-west-2, access_key=REDACTED}
Apr-09 10:42:06.686 [main] DEBUG nextflow.cli.CmdRun - 
  Version: 19.0.0.modules-draft10-SNAPSHOT build 5059
  Modified: 01-04-2019 22:59 UTC (15:59 PDT)
  System: Mac OS X 10.14.4
  Runtime: Groovy 2.5.6 on OpenJDK 64-Bit Server VM 1.8.0_152-release-1056-b12
  Encoding: UTF-8 (UTF-8)
  Process: 44666@rspreafico-vir.local [10.184.235.215]
  CPUs: 4 - Mem: 16 GB (191.2 MB) - Swap: 3 GB (768.8 MB)
Apr-09 10:42:07.097 [main] DEBUG nextflow.Session - Work-dir: s3://vir-nf-batch/work [Mac OS X]
Apr-09 10:42:07.097 [main] DEBUG nextflow.Session - Script base path does not exist or is not a directory: /Users/rspreafico/workspace/nf-rnaseq/bin
Apr-09 10:42:07.218 [main] DEBUG nextflow.Session - Session start invoked
Apr-09 10:42:07.222 [main] DEBUG nextflow.trace.TraceFileObserver - Flow starting -- trace file: /Users/rspreafico/workspace/nf-rnaseq/trace.tsv
Apr-09 10:42:07.498 [main] DEBUG nextflow.script.ScriptRunner - > Launching execution
Apr-09 10:42:07.505 [main] WARN  nextflow.NextflowMeta$Preview - DSL 2 IS AN EXPERIMENTAL FEATURE UNDER DEVELOPMENT -- SYNTAX MAY CHANGE IN FUTURE RELEASE
Apr-09 10:42:07.665 [main] WARN  nextflow.NextflowMeta$Preview - DSL 2 IS AN EXPERIMENTAL FEATURE UNDER DEVELOPMENT -- SYNTAX MAY CHANGE IN FUTURE RELEASE
Apr-09 10:42:07.747 [main] WARN  nextflow.NextflowMeta$Preview - DSL 2 IS AN EXPERIMENTAL FEATURE UNDER DEVELOPMENT -- SYNTAX MAY CHANGE IN FUTURE RELEASE
Apr-09 10:42:07.856 [main] WARN  nextflow.NextflowMeta$Preview - DSL 2 IS AN EXPERIMENTAL FEATURE UNDER DEVELOPMENT -- SYNTAX MAY CHANGE IN FUTURE RELEASE
Apr-09 10:42:07.964 [main] WARN  nextflow.NextflowMeta$Preview - DSL 2 IS AN EXPERIMENTAL FEATURE UNDER DEVELOPMENT -- SYNTAX MAY CHANGE IN FUTURE RELEASE
Apr-09 10:42:08.066 [main] WARN  nextflow.NextflowMeta$Preview - DSL 2 IS AN EXPERIMENTAL FEATURE UNDER DEVELOPMENT -- SYNTAX MAY CHANGE IN FUTURE RELEASE
Apr-09 10:42:08.136 [main] WARN  nextflow.NextflowMeta$Preview - DSL 2 IS AN EXPERIMENTAL FEATURE UNDER DEVELOPMENT -- SYNTAX MAY CHANGE IN FUTURE RELEASE
Apr-09 10:42:08.232 [main] WARN  nextflow.NextflowMeta$Preview - DSL 2 IS AN EXPERIMENTAL FEATURE UNDER DEVELOPMENT -- SYNTAX MAY CHANGE IN FUTURE RELEASE
Apr-09 10:42:08.299 [main] WARN  nextflow.NextflowMeta$Preview - DSL 2 IS AN EXPERIMENTAL FEATURE UNDER DEVELOPMENT -- SYNTAX MAY CHANGE IN FUTURE RELEASE
Apr-09 10:42:08.447 [main] WARN  nextflow.NextflowMeta$Preview - DSL 2 IS AN EXPERIMENTAL FEATURE UNDER DEVELOPMENT -- SYNTAX MAY CHANGE IN FUTURE RELEASE
Apr-09 10:42:08.540 [main] WARN  nextflow.NextflowMeta$Preview - DSL 2 IS AN EXPERIMENTAL FEATURE UNDER DEVELOPMENT -- SYNTAX MAY CHANGE IN FUTURE RELEASE
Apr-09 10:42:09.830 [main] DEBUG nextflow.executor.ExecutorFactory - << taskConfig executor: awsbatch
Apr-09 10:42:09.830 [main] DEBUG nextflow.executor.ExecutorFactory - >> processorType: 'awsbatch'
Apr-09 10:42:09.835 [main] INFO  nextflow.executor.Executor - [warm up] executor > awsbatch
Apr-09 10:42:09.850 [main] DEBUG nextflow.util.ThrottlingExecutor - Creating throttling executor with opts: nextflow.util.ThrottlingExecutor$Options(poolName:AWSBatch-executor, limiter:RateLimiter[stableRate=50.0qps], poolSize:20, maxPoolSize:20, queueSize:5000, maxRetries:10, keepAlive:1m, autoThrottle:true, errorBurstDelay:1s, rampUpInterval:100, rampUpFactor:1.2, rampUpMaxRate:1.7976931348623157E308, backOffFactor:2.0, backOffMinRate:0.0166666667, retryDelay:1s)
Apr-09 10:42:09.856 [main] DEBUG nextflow.util.ThrottlingExecutor - Creating throttling executor with opts: nextflow.util.ThrottlingExecutor$Options(poolName:AWSBatch-reaper, limiter:RateLimiter[stableRate=50.0qps], poolSize:20, maxPoolSize:20, queueSize:5000, maxRetries:10, keepAlive:1m, autoThrottle:true, errorBurstDelay:1s, rampUpInterval:100, rampUpFactor:1.2, rampUpMaxRate:1.7976931348623157E308, backOffFactor:2.0, backOffMinRate:0.0166666667, retryDelay:1s)
Apr-09 10:42:09.856 [main] DEBUG n.cloud.aws.batch.AwsBatchExecutor - Creating parallel monitor for executor 'awsbatch' > pollInterval=10s; dumpInterval=5m
Apr-09 10:42:09.859 [main] DEBUG n.processor.TaskPollingMonitor - >>> barrier register (monitor: awsbatch)
Apr-09 10:42:09.878 [main] DEBUG nextflow.Global - Using AWS credentials defined in nextflow config file
Apr-09 10:42:09.926 [main] DEBUG nextflow.Session - >>> barrier register (process: gtf2genePred)
Apr-09 10:42:09.928 [main] DEBUG nextflow.processor.TaskProcessor - Creating operator > gtf2genePred -- maxForks: 4
Apr-09 10:42:09.948 [main] DEBUG nextflow.executor.ExecutorFactory - << taskConfig executor: awsbatch
Apr-09 10:42:09.948 [main] DEBUG nextflow.executor.ExecutorFactory - >> processorType: 'awsbatch'
Apr-09 10:42:09.949 [main] DEBUG nextflow.Session - >>> barrier register (process: genePred2bed)
Apr-09 10:42:09.949 [main] DEBUG nextflow.processor.TaskProcessor - Creating operator > genePred2bed -- maxForks: 4
Apr-09 10:42:09.952 [main] DEBUG nextflow.executor.ExecutorFactory - << taskConfig executor: awsbatch
Apr-09 10:42:09.952 [main] DEBUG nextflow.executor.ExecutorFactory - >> processorType: 'awsbatch'
Apr-09 10:42:09.952 [main] DEBUG nextflow.Session - >>> barrier register (process: gtf2refFlat)
Apr-09 10:42:09.953 [main] DEBUG nextflow.processor.TaskProcessor - Creating operator > gtf2refFlat -- maxForks: 4
Apr-09 10:42:09.958 [main] DEBUG nextflow.executor.ExecutorFactory - << taskConfig executor: awsbatch
Apr-09 10:42:09.958 [main] DEBUG nextflow.executor.ExecutorFactory - >> processorType: 'awsbatch'
Apr-09 10:42:09.959 [main] DEBUG nextflow.Session - >>> barrier register (process: fasta2chromSizes)
Apr-09 10:42:09.959 [main] DEBUG nextflow.processor.TaskProcessor - Creating operator > fasta2chromSizes -- maxForks: 4
Apr-09 10:42:09.964 [main] DEBUG nextflow.executor.ExecutorFactory - << taskConfig executor: awsbatch
Apr-09 10:42:09.964 [main] DEBUG nextflow.executor.ExecutorFactory - >> processorType: 'awsbatch'
Apr-09 10:42:09.965 [main] DEBUG nextflow.Session - >>> barrier register (process: gtf2intervalList)
Apr-09 10:42:09.965 [main] DEBUG nextflow.processor.TaskProcessor - Creating operator > gtf2intervalList -- maxForks: 4
Apr-09 10:42:09.975 [main] DEBUG nextflow.executor.ExecutorFactory - << taskConfig executor: awsbatch
Apr-09 10:42:09.975 [main] DEBUG nextflow.executor.ExecutorFactory - >> processorType: 'awsbatch'
Apr-09 10:42:09.976 [main] DEBUG nextflow.Session - >>> barrier register (process: fastqc)
Apr-09 10:42:09.977 [main] DEBUG nextflow.processor.TaskProcessor - Creating operator > fastqc -- maxForks: 4
Apr-09 10:42:09.992 [main] DEBUG nextflow.executor.ExecutorFactory - << taskConfig executor: awsbatch
Apr-09 10:42:09.992 [main] DEBUG nextflow.executor.ExecutorFactory - >> processorType: 'awsbatch'
Apr-09 10:42:09.993 [main] DEBUG nextflow.Session - >>> barrier register (process: trimmomatic)
Apr-09 10:42:09.994 [main] DEBUG nextflow.processor.TaskProcessor - Creating operator > trimmomatic -- maxForks: 4
Apr-09 10:42:10.005 [main] DEBUG nextflow.executor.ExecutorFactory - << taskConfig executor: awsbatch
Apr-09 10:42:10.005 [main] DEBUG nextflow.executor.ExecutorFactory - >> processorType: 'awsbatch'
Apr-09 10:42:10.006 [main] DEBUG nextflow.Session - >>> barrier register (process: salmon_index)
Apr-09 10:42:10.007 [main] DEBUG nextflow.processor.TaskProcessor - Creating operator > salmon_index -- maxForks: 1
Apr-09 10:42:10.014 [main] DEBUG nextflow.executor.ExecutorFactory - << taskConfig executor: awsbatch
Apr-09 10:42:10.015 [main] DEBUG nextflow.executor.ExecutorFactory - >> processorType: 'awsbatch'
Apr-09 10:42:10.016 [main] DEBUG nextflow.Session - >>> barrier register (process: salmon_quant)
Apr-09 10:42:10.016 [main] DEBUG nextflow.processor.TaskProcessor - Creating operator > salmon_quant -- maxForks: 8
Apr-09 10:42:10.020 [main] DEBUG nextflow.executor.ExecutorFactory - << taskConfig executor: awsbatch
Apr-09 10:42:10.020 [main] DEBUG nextflow.executor.ExecutorFactory - >> processorType: 'awsbatch'
Apr-09 10:42:10.021 [main] DEBUG nextflow.Session - >>> barrier register (process: star_index)
Apr-09 10:42:10.021 [main] DEBUG nextflow.processor.TaskProcessor - Creating operator > star_index -- maxForks: 1
Apr-09 10:42:10.028 [main] DEBUG nextflow.executor.ExecutorFactory - << taskConfig executor: awsbatch
Apr-09 10:42:10.028 [main] DEBUG nextflow.executor.ExecutorFactory - >> processorType: 'awsbatch'
Apr-09 10:42:10.028 [main] DEBUG nextflow.Session - >>> barrier register (process: star_align)
Apr-09 10:42:10.028 [main] DEBUG nextflow.processor.TaskProcessor - Creating operator > star_align -- maxForks: 1
Apr-09 10:42:10.033 [main] DEBUG nextflow.executor.ExecutorFactory - << taskConfig executor: awsbatch
Apr-09 10:42:10.033 [main] DEBUG nextflow.executor.ExecutorFactory - >> processorType: 'awsbatch'
Apr-09 10:42:10.034 [main] DEBUG nextflow.Session - >>> barrier register (process: picard_markduplicates)
Apr-09 10:42:10.034 [main] DEBUG nextflow.processor.TaskProcessor - Creating operator > picard_markduplicates -- maxForks: 2
Apr-09 10:42:10.039 [main] DEBUG nextflow.executor.ExecutorFactory - << taskConfig executor: awsbatch
Apr-09 10:42:10.039 [main] DEBUG nextflow.executor.ExecutorFactory - >> processorType: 'awsbatch'
Apr-09 10:42:10.040 [main] DEBUG nextflow.Session - >>> barrier register (process: samtools_index)
Apr-09 10:42:10.040 [main] DEBUG nextflow.processor.TaskProcessor - Creating operator > samtools_index -- maxForks: 4
Apr-09 10:42:10.043 [main] WARN  nextflow.extension.OperatorEx - The operator `first` is useless when applied to a value channel which returns a single value by definition
Apr-09 10:42:10.046 [main] DEBUG nextflow.executor.ExecutorFactory - << taskConfig executor: awsbatch
Apr-09 10:42:10.047 [main] DEBUG nextflow.executor.ExecutorFactory - >> processorType: 'awsbatch'
Apr-09 10:42:10.047 [main] DEBUG nextflow.Session - >>> barrier register (process: wig2bigwig)
Apr-09 10:42:10.047 [main] DEBUG nextflow.processor.TaskProcessor - Creating operator > wig2bigwig -- maxForks: 4
Apr-09 10:42:10.051 [main] DEBUG nextflow.executor.ExecutorFactory - << taskConfig executor: awsbatch
Apr-09 10:42:10.052 [main] DEBUG nextflow.executor.ExecutorFactory - >> processorType: 'awsbatch'
Apr-09 10:42:10.052 [main] DEBUG nextflow.Session - >>> barrier register (process: dupradar)
Apr-09 10:42:10.052 [main] DEBUG nextflow.processor.TaskProcessor - Creating operator > dupradar -- maxForks: 4
Apr-09 10:42:10.058 [main] DEBUG nextflow.Session - Session aborted -- Cause: No signature of method: groovyx.gpars.dataflow.DataflowBroadcast.getAt() is applicable for argument types: (Integer) values: [0]
Possible solutions: getAt(java.lang.String), putAt(java.lang.String, java.lang.Object), set(groovy.lang.Closure), wait(), grep(), tap(groovy.lang.Closure)

As for the code causing the issue, not sure since the UI does not report the error, and there is indication of the line number or code snippets potentially causing the problem. I do have a bunch of calls like this that seem related to the error message:

picard_markduplicates(star_align.output[0])
samtools_index(picard_markduplicates.output[0])

but I am puzzled because the same pipeline with the same .output[0] syntax worked before...

Roberto Spreafico · Answer 82 · Wed Apr 17 2019 08:34:39 GMT+0800 (China Standard Time)

Ok found the issue, there was a missing .output. So the log got it right, however the issue is that no error was printed on the console.

Roberto Spreafico · Answer 83 · Thu Apr 18 2019 00:55:37 GMT+0800 (China Standard Time)

Tried to call fastqc twice using draft10. Works fine with just first call, but not with two calls.

First attempt:

fastqc(reads | flatMap { x -> x[2] })
fastqc_raw_output = fastqc.output

trimmomatic(reads)
 
fastqc(trimmomatic.output[0] | flatMap { x -> x[2] })
fastqc_trimmed_output = fastqc.output

Nextflow exits without printing errors to the console, but .nextflow.log reports:

Apr-17 16:42:17.226 [main] DEBUG nextflow.Session - Session aborted -- Cause: Channel `fastqc_raw_output` has been used twice as an output by process `fastqc` and process `fastqc`

Second attempt:

fastqc_raw_output = fastqc(reads | flatMap { x -> x[2] }).output
trimmomatic(reads)
fastqc_trimmed_output = fastqc(trimmomatic.output[0] | flatMap { x -> x[2] }).output

No error on console by Nextflow, but exits. .nextflow.log reads:

Apr-17 16:44:59.654 [main] DEBUG nextflow.Session - Session aborted -- Cause: No such property: output for class: groovyx.gpars.dataflow.DataflowBroadcast

Third attempt:

fastqc_raw_output = fastqc(reads | flatMap { x -> x[2] })
trimmomatic(reads)
fastqc_trimmed_output = fastqc(trimmomatic.output[0] | flatMap { x -> x[2] })

Got an unrelated error from Nextflow console. .nextflow.log reads:

Apr-17 16:52:26.144 [main] DEBUG nextflow.Session - Session aborted -- Cause: Channel `fastqc_raw_output` has been used twice as an output by process `fastqc` and process `fastqc`

(fastqc_raw_output and fastqc_trimmed_output are passed to MultiQC, but only once each)

What is the right way to call the same module twice and store the output from each call?

Paolo Di Tommaso · Answer 84 · Thu Apr 18 2019 00:57:41 GMT+0800 (China Standard Time)

Next will merge on master and start to debug this

Roberto Spreafico · Answer 85 · Thu Apr 18 2019 01:03:32 GMT+0800 (China Standard Time)

Looking forward! Other than somehow cryptic error messages (or no error messages at all on the console) and inability to call the same module twice, it has been working like a charm, locally and in AWS Batch, super-excited about this. Also looking forward to the publishDir operator ;-)

Paolo Di Tommaso · Answer 86 · Thu Apr 18 2019 01:10:36 GMT+0800 (China Standard Time)

I'm quite impressed on the expressiveness of the syntax you managed to put together!

Roberto Spreafico · Answer 87 · Thu Apr 18 2019 01:16:02 GMT+0800 (China Standard Time)

It's all thanks to DSL-2! ;-)

Paolo Di Tommaso · Answer 88 · Sun Apr 28 2019 00:04:24 GMT+0800 (China Standard Time)

The missing should be fixed now. Instead, I was too optimistic regarding the multiple invocations of the same problem. There are still some problems with the name conflicting in legacy structures. For now, it's only possible including the same process with a different alias ie. include x as y from z.

These changes have been committed to the master branch. I'm closing this issue because it's becoming too complicated to follow.

I'll open other issues to follow up on specific enhancements. If you find any error/malfunctions please report as a separate issue including the .nextflow.log file and a snippet to replicate the problem.

Roberto Spreafico · Answer 89 · Sun Apr 28 2019 00:51:38 GMT+0800 (China Standard Time)

If the aliasing strategy works, that is perfect for me. Thanks for addressing it!

Roberto Spreafico · Answer 90 · Mon Apr 29 2019 06:56:03 GMT+0800 (China Standard Time)

The aliasing should be supported by draft10 already, correct? 'cause I am trying it but turning this

include fastqc from 'modules/fastqc'

into this

include fastqc as fastqc_raw from 'modules/fastqc'

produces the following error

ERROR ~ Unexpected error [NullPointerException]

 -- Check script 'main.nf' at line: 6 or see '.nextflow.log' file for more details

Paolo Di Tommaso · Answer 91 · Mon Apr 29 2019 17:08:46 GMT+0800 (China Standard Time)

@rspreafico-vir You need to clone and to build the master branch or use the 19.05.0-SNAPSHOT version.

Roberto Spreafico · Answer 92 · Tue Apr 30 2019 00:30:14 GMT+0800 (China Standard Time)

This works great with 19.05.0-SNAPSHOT. Thank you!!

Ryan Gerlach · Answer 93 · Fri May 10 2019 02:59:02 GMT+0800 (China Standard Time)

Is there any current plan for when this might be officially released?

Paolo Di Tommaso · Answer 94 · Fri May 10 2019 03:09:14 GMT+0800 (China Standard Time)

So far this https://www.nextflow.io/docs/edge/dsl2.html

Anthony Underwood · Answer 95 · Fri May 10 2019 03:25:52 GMT+0800 (China Standard Time)

@pditommaso Is this available on the 19.04.1 release?

Roberto Spreafico · Answer 96 · Fri May 10 2019 03:27:09 GMT+0800 (China Standard Time)

Nope, kindly see a few comments up. Requires 19.05.0-SNAPSHOT

Anthony Underwood · Answer 97 · Fri May 10 2019 03:28:38 GMT+0800 (China Standard Time)

Thanks @rspreafico-vir. I am on the point of submitting something to nf-core and would dearly love it to be using DSL-2!

Roberto Spreafico · Answer 98 · Fri May 10 2019 03:31:00 GMT+0800 (China Standard Time)

DSL-2 will be great for nf-core! It is easy to envision a carefully crafted library of modules, one per tool, in nf-core. In addition to being great for nf-core pipelines, such nf-core modules would be useful per se for end users.

Harshil Patel · Answer 99 · Fri May 10 2019 03:53:21 GMT+0800 (China Standard Time)

@aunderwo Finally! I remember talking to you about this at the NF conference last year 😄 Really looking forward to this functionality being added to nf-core. Be nice to create a standardised set of modules for the community.

Arnaud Bore · Answer 100 · Wed May 22 2019 01:44:02 GMT+0800 (China Standard Time)

Should we re-open this issue ? Very excited about the new release!