ssadedin / bpipe

Bpipe - a tool for running and managing bioinformatics pipelines

Home Page:http://docs.bpipe.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

branching + extra steps based on filename/extension?

gdevenyi opened this issue · comments

I'm reading through the language specification and I can't quite determine if this is feasible:

Suppose I have a pipeline that I want to support two types of input files of different filetypes, the pipeline actually only can handle one type, but there's a convenient converter available to go between them.

Is it possible to get bpipe to run an optional stage based on the filename, and then pass things on further down the pipeline?

I've thought for a while about adding a feature to support this more naturally, though I've gone back and forward on the right way to do it. There's a fairly simple workaround you can do simply by catching the exception that is thrown when an input is missing:

hello = {
    try {
        forward input.csv
    }
    catch(bpipe.InputMissingError e) {
        exec """
            cp -v $input.txt $output.csv
        """
    }
}

world = {
    exec """
        cp -v $input.csv $output.xls
    """
}

run {
    hello + world
}

So in this pipeline, you can pass either a .txt file or a .csv file. If you pass .txt it does a "conversion" to CSV format in the hello stage, if not, that is skipped.

Does this address your case?

Yup, this will handle the case fine!

Hi,

I'm trying this out now and it doesn't seem to work like I'm expecting:

convert = {
  branch.name = "${branch.name}.convert"
    try {
        exec """
            mincconvert -2 -clobber -compress 9 $input.mnc $output.mnc
        """
    }
    catch(bpipe.InputMissingError e) {
        try {
          exec """
              nii2mnc $input.nii.gz $output.mnc
          """
        }
        catch(bpipe.InputMissingError e2) {
          exec """
              nii2mnc $input.nii $output.mnc
          """
        }
    }
}

preprocess = segment {
	//Default best-practices preprocessing pipeline to run on all data
	convert + n4correct + linear_bestlinreg + cutneckapplyautocrop + beast + QC + resample_to_lsq6_space
}

run {
 "%" * [preprocess]
}

When I run this, the convert stage does nothing and the stage is run 3 times with subject3.mnc:

$ cat commandlog.txt 
####################################################################################################
# Starting pipeline at Mon Mar 25 16:20:58 EDT 2019
# Input files:  [../subject1.nii.gz, ../subject2.nii, ../subject3.mnc]
# Output Log:  .bpipe/logs/4891.log
# Stage convert (subject1.nii)
# Stage convert (subject2)
# Stage convert (subject3)
mincconvert -2 -clobber -compress 9 ../subject3.mnc subject3.convert.mnc
mincconvert -2 -clobber -compress 9 ../subject3.mnc subject3.convert.mnc
mincconvert -2 -clobber -compress 9 ../subject3.mnc subject3.convert.mnc

Did I mess something here?

Scratching my head about how that could happen - just to confirm, you're supplying exactly one input which is ../subject3.mnc?

I made a dummy version of it myself and wasn't able to reproduce that:

test.groovy:

convert = {
  branch.name = "${branch.name}.convert"
    try {
        exec """
            cat $input.mnc > $output.mnc
        """
    }
    catch(bpipe.InputMissingError e) {
        try {
          exec """
              nii2mnc $input.nii.gz $output.mnc
          """
        }
        catch(bpipe.InputMissingError e2) {
          exec """
              nii2mnc $input.nii $output.mnc
          """
        }
    }
}

preprocess = segment {
    //Default best-practices preprocessing pipeline to run on all data
    convert 
}

run {
 "%" * [preprocess]
}

Which I run using:

touch test.mnc
bpipe run test.groovy  test.mnc

And the result is:

####################################################################################################
# Starting pipeline at Wed Mar 27 09:04:39 AEDT 2019
# Input files:  test.mnc
# Output Log:  .bpipe/logs/29493.log
# Stage convert (test)
cat test.mnc > test.convert.mnc
# ################ Finished at Wed Mar 27 09:04:40 AEDT 2019 Duration = 1.338 seconds ################

I did run this using the 0.9.8-beta version, though I would have thought it should not make a difference in this regard.

No, I'm providing three inputs:
../subject1.nii.gz ../subject2.nii ../subject3.mnc

Ah, now I can reproduce the problem - thanks!

I can see what is happening. The problem is that when you ask for $input.mnc it is actually always found, because the downstream branches have access to the original inputs, which include that input (for all 3 branches).

I will have to think about what the right solution / behavior should be here. Will followup!

Here's a slightly awkward workaround (but arguably, less awkward than the original):

convert = {
  branch.name = "${branch.name}.convert"

  if(file(input).name.endsWith('.mnc')) {
        exec """
            cp -v $input.mnc $output.mnc
        """
  }
  else
  if(file(input).name.endsWith('.gz')) {
      exec """
          cp -v $input.nii.gz $output.mnc
      """
  }
  else
  if(file(input).name.endsWith('.nii')) {
      exec """
          cp -v $input.nii $output.mnc
      """
  }
}

Wonderful. It's hard to wrap my head around sometimes that the pipeline is also sort of java code, so clever things like that are available.

So, this works as expected except for one point, downstream, these are the new basenames of the files:
subject1.nii.convert.mnc subject2.convert.mnc subject3.convert.mnc

subject1, which was originally subject1.nii.gz only got the .gz stripped off, even though the command that processed it was

nii2mnc $input.nii.gz $output.mnc

I thought that the ".nii.gz" hinted to bpipe to strip that from the filename to get to a basename.

Unfortunately no, it only replaces the last extension. You can hit it to do that by specifying the transform explicitly - something like this:

transform('.nii.gz') to('.mnc') {
  ...
}

So I guess I've been treating the $input(.ext) and $output(.ext) features as implicit produce and transform statements.

It seems it might be a bit inconsistent that $input.nii.gz will match the input with that extension, but at the same time not strip everything after $input to get the basename.

A followup, after adding the transform, the pipeline still considers subject1.nii to be the basename:

####################################################################################################
# Starting pipeline at Mon Apr 01 10:16:27 EDT 2019
# Input files:  [../subject1.nii.gz, ../subject2.nii, ../subject3.mnc]
# Output Log:  .bpipe/logs/25089.log
# Stage convert (subject1.nii)
# Stage convert (subject3)
# Stage convert (subject2)
nii2mnc -clobber ../subject2.nii subject2.convert.mnc
nii2mnc -clobber ../subject1.nii.gz subject1.mnc
mincconvert -2 -clobber -compress 9 ../subject3.mnc subject3.convert.mnc

Closing this as the original question was answered.