branching + extra steps based on filename/extension?

Question

branching + extra steps based on filename/extension?

gdevenyi opened this issue 6 years ago · comments

Gabriel A. Devenyi commented 6 years ago

I'm reading through the language specification and I can't quite determine if this is feasible:

Suppose I have a pipeline that I want to support two types of input files of different filetypes, the pipeline actually only can handle one type, but there's a convenient converter available to go between them.

Is it possible to get bpipe to run an optional stage based on the filename, and then pass things on further down the pipeline?

Simon Sadedin · Answer 1 · Sun Jan 20 2019 07:48:17 GMT+0800 (China Standard Time)

I've thought for a while about adding a feature to support this more naturally, though I've gone back and forward on the right way to do it. There's a fairly simple workaround you can do simply by catching the exception that is thrown when an input is missing:

hello = {
    try {
        forward input.csv
    }
    catch(bpipe.InputMissingError e) {
        exec """
            cp -v $input.txt $output.csv
        """
    }
}

world = {
    exec """
        cp -v $input.csv $output.xls
    """
}

run {
    hello + world
}

So in this pipeline, you can pass either a .txt file or a .csv file. If you pass .txt it does a "conversion" to CSV format in the hello stage, if not, that is skipped.

Does this address your case?

Gabriel A. Devenyi · Answer 2 · Mon Jan 21 2019 22:59:35 GMT+0800 (China Standard Time)

Yup, this will handle the case fine!

Gabriel A. Devenyi · Answer 3 · Tue Mar 26 2019 04:19:24 GMT+0800 (China Standard Time)

Hi,

I'm trying this out now and it doesn't seem to work like I'm expecting:

convert = {
  branch.name = "${branch.name}.convert"
    try {
        exec """
            mincconvert -2 -clobber -compress 9 $input.mnc $output.mnc
        """
    }
    catch(bpipe.InputMissingError e) {
        try {
          exec """
              nii2mnc $input.nii.gz $output.mnc
          """
        }
        catch(bpipe.InputMissingError e2) {
          exec """
              nii2mnc $input.nii $output.mnc
          """
        }
    }
}

preprocess = segment {
	//Default best-practices preprocessing pipeline to run on all data
	convert + n4correct + linear_bestlinreg + cutneckapplyautocrop + beast + QC + resample_to_lsq6_space
}

run {
 "%" * [preprocess]
}

When I run this, the convert stage does nothing and the stage is run 3 times with subject3.mnc:

$ cat commandlog.txt 
####################################################################################################
# Starting pipeline at Mon Mar 25 16:20:58 EDT 2019
# Input files:  [../subject1.nii.gz, ../subject2.nii, ../subject3.mnc]
# Output Log:  .bpipe/logs/4891.log
# Stage convert (subject1.nii)
# Stage convert (subject2)
# Stage convert (subject3)
mincconvert -2 -clobber -compress 9 ../subject3.mnc subject3.convert.mnc
mincconvert -2 -clobber -compress 9 ../subject3.mnc subject3.convert.mnc
mincconvert -2 -clobber -compress 9 ../subject3.mnc subject3.convert.mnc

Did I mess something here?

Simon Sadedin · Answer 4 · Wed Mar 27 2019 06:07:32 GMT+0800 (China Standard Time)

Scratching my head about how that could happen - just to confirm, you're supplying exactly one input which is ../subject3.mnc?

I made a dummy version of it myself and wasn't able to reproduce that:

test.groovy:

convert = {
  branch.name = "${branch.name}.convert"
    try {
        exec """
            cat $input.mnc > $output.mnc
        """
    }
    catch(bpipe.InputMissingError e) {
        try {
          exec """
              nii2mnc $input.nii.gz $output.mnc
          """
        }
        catch(bpipe.InputMissingError e2) {
          exec """
              nii2mnc $input.nii $output.mnc
          """
        }
    }
}

preprocess = segment {
    //Default best-practices preprocessing pipeline to run on all data
    convert 
}

run {
 "%" * [preprocess]
}

Which I run using:

touch test.mnc
bpipe run test.groovy  test.mnc

And the result is:

####################################################################################################
# Starting pipeline at Wed Mar 27 09:04:39 AEDT 2019
# Input files:  test.mnc
# Output Log:  .bpipe/logs/29493.log
# Stage convert (test)
cat test.mnc > test.convert.mnc
# ################ Finished at Wed Mar 27 09:04:40 AEDT 2019 Duration = 1.338 seconds ################

I did run this using the 0.9.8-beta version, though I would have thought it should not make a difference in this regard.

Gabriel A. Devenyi · Answer 5 · Wed Mar 27 2019 10:02:34 GMT+0800 (China Standard Time)

No, I'm providing three inputs:
../subject1.nii.gz ../subject2.nii ../subject3.mnc

Simon Sadedin · Answer 6 · Thu Mar 28 2019 18:12:31 GMT+0800 (China Standard Time)

Ah, now I can reproduce the problem - thanks!

I can see what is happening. The problem is that when you ask for $input.mnc it is actually always found, because the downstream branches have access to the original inputs, which include that input (for all 3 branches).

I will have to think about what the right solution / behavior should be here. Will followup!

Simon Sadedin · Answer 7 · Thu Mar 28 2019 18:19:38 GMT+0800 (China Standard Time)

Here's a slightly awkward workaround (but arguably, less awkward than the original):

convert = {
  branch.name = "${branch.name}.convert"

  if(file(input).name.endsWith('.mnc')) {
        exec """
            cp -v $input.mnc $output.mnc
        """
  }
  else
  if(file(input).name.endsWith('.gz')) {
      exec """
          cp -v $input.nii.gz $output.mnc
      """
  }
  else
  if(file(input).name.endsWith('.nii')) {
      exec """
          cp -v $input.nii $output.mnc
      """
  }
}

Gabriel A. Devenyi · Answer 8 · Thu Mar 28 2019 20:47:45 GMT+0800 (China Standard Time)

Wonderful. It's hard to wrap my head around sometimes that the pipeline is also sort of java code, so clever things like that are available.

Gabriel A. Devenyi · Answer 9 · Thu Mar 28 2019 21:00:17 GMT+0800 (China Standard Time)

So, this works as expected except for one point, downstream, these are the new basenames of the files:
subject1.nii.convert.mnc subject2.convert.mnc subject3.convert.mnc

subject1, which was originally subject1.nii.gz only got the .gz stripped off, even though the command that processed it was

nii2mnc $input.nii.gz $output.mnc

I thought that the ".nii.gz" hinted to bpipe to strip that from the filename to get to a basename.

Simon Sadedin · Answer 10 · Sat Mar 30 2019 14:30:47 GMT+0800 (China Standard Time)

Unfortunately no, it only replaces the last extension. You can hit it to do that by specifying the transform explicitly - something like this:

transform('.nii.gz') to('.mnc') {
  ...
}

Gabriel A. Devenyi · Answer 11 · Sun Mar 31 2019 01:43:15 GMT+0800 (China Standard Time)

So I guess I've been treating the $input(.ext) and $output(.ext) features as implicit produce and transform statements.

It seems it might be a bit inconsistent that $input.nii.gz will match the input with that extension, but at the same time not strip everything after $input to get the basename.

Gabriel A. Devenyi · Answer 12 · Mon Apr 01 2019 22:23:28 GMT+0800 (China Standard Time)

A followup, after adding the transform, the pipeline still considers subject1.nii to be the basename:

####################################################################################################
# Starting pipeline at Mon Apr 01 10:16:27 EDT 2019
# Input files:  [../subject1.nii.gz, ../subject2.nii, ../subject3.mnc]
# Output Log:  .bpipe/logs/25089.log
# Stage convert (subject1.nii)
# Stage convert (subject3)
# Stage convert (subject2)
nii2mnc -clobber ../subject2.nii subject2.convert.mnc
nii2mnc -clobber ../subject1.nii.gz subject1.mnc
mincconvert -2 -clobber -compress 9 ../subject3.mnc subject3.convert.mnc

Gabriel A. Devenyi · Answer 13 · Thu Apr 18 2019 03:57:09 GMT+0800 (China Standard Time)

Closing this as the original question was answered.