Inputs when pooling results
hh1985 opened this issue · comments
Hi @ssadedin ,
I used a bpipe script to process fastq files, but got strange inputs for the final pooling step. The workflow is like:
check_input + "%_[rR]*.fastq.gz" * [ kneaddata + concatenate + humann2] + mergeMetaphlan
humann2
produces tsv file. I assumed the inputs for stage mergeMetaphlan
are a collection of tsv files. However, the inputs
are dereferenced as fastq.gz files (files from “%_[rR]*.fastq.gz”
). This is not the expected behavior.
The log looks like:
1301 bpipe.PipelineCategory [1] INFO |12:44:10 There [[id:null, stageName:humann2, startMs:1530031449788, endMs:1530031449801, branch:p136C, threadId:40, succeeded:true], [id:null, stageName:humann2, startMs :1530031449788, endMs:1530031449811, branch:p136N, threadId:41, succeeded:true]] parallel paths in final stage
1302 bpipe.PipelineCategory [1] INFO |12:44:10 Last merged outputs are [/home/hanh/projects/xbiome_pipeline_16s/test/meta_profiling/humann2Prof/p136C_humann2_out/p136C_humann2_temp/p136C_metaphlan_bugs_list.tsv, /home/hanh/projects/xbiome_pipeline_16s/test/meta_profiling/humann2Prof/p136N_humann2_out/p136N_humann2_temp/p136N_metaphlan_bugs_list.tsv]
1303 bpipe.Utils [1] INFO |12:44:10 Setting output [/home/hanh/projects/xbiome_pipeline_16s/test/meta_profiling/humann2Prof/p136C_humann2_out/p136C_humann2_temp/p136C_metaphlan_bugs_list.tsv, /home/hanh/proje cts/xbiome_pipeline_16s/test/meta_profiling/humann2Prof/p136N_humann2_out/p136N_humann2_temp/p136N_metaphlan_bugs_list.tsv] on context 1221981006 in thread 1
1304 bpipe.PipelineCategory [1] INFO |12:44:10 Merged stage name is humann2_humann2_bpipe_merge
1305 bpipe.PipelineStage [1] INFO |12:44:10 Stage 2 returned null as default inputs for next stage
1306 bpipe.PipelineStage [1] INFO |12:44:10 Inputs are NOT being inferred from context.output (context.nextInputs=null)
1307 bpipe.PipelineStage [1] INFO |12:44:10 Inferring nextInputs from inputs bpipe.PipelineContext@cd1d761.@input
1308 bpipe.PipelineStage [1] INFO |12:44:10 No explicit output on stage 1914108708 context 215078753 so output is nextInputs [/home/hanh/projects/xbiome_pipeline_16s/test/data/meta_tutorial/mgs_tutorial_Oct20 17/raw_data/run/p136C_R1.fastq.gz, /home/hanh/projects/xbiome_pipeline_16s/test/data/meta_tutorial/mgs_tutorial_Oct2017/raw_data/run/p136N_R1.fastq.gz, /home/hanh/projects/xbiome_pipeline_16s/test/data/meta _tutorial/mgs_tutorial_Oct2017/raw_data/run/p136N_R2.fastq.gz, /home/hanh/projects/xbiome_pipeline_16s/test/data/meta_tutorial/mgs_tutorial_Oct2017/raw_data/run/p136C_R2.fastq.gz]
1309 bpipe.Utils [1] INFO |12:44:10 Setting output [/home/hanh/projects/xbiome_pipeline_16s/test/data/meta_tutorial/mgs_tutorial_Oct2017/raw_data/run/p136C_R1.fastq.gz, /home/hanh/projects/xbiome_pipeline_16s /test/data/meta_tutorial/mgs_tutorial_Oct2017/raw_data/run/p136N_R1.fastq.gz, /home/hanh/projects/xbiome_pipeline_16s/test/data/meta_tutorial/mgs_tutorial_Oct2017/raw_data/run/p136N_R2.fastq.gz, /home/hanh/ projects/xbiome_pipeline_16s/test/data/meta_tutorial/mgs_tutorial_Oct2017/raw_data/run/p136C_R2.fastq.gz] on context 215078753 in thread 1
1310 bpipe.PipelineStage [1] INFO |12:44:10 Setting next inputs [/home/hanh/projects/xbiome_pipeline_16s/test/data/meta_tutorial/mgs_tutorial_Oct2017/raw_data/run/p136C_R1.fastq.gz, /home/hanh/projects/xbiome _pipeline_16s/test/data/meta_tutorial/mgs_tutorial_Oct2017/raw_data/run/p136N_R1.fastq.gz, /home/hanh/projects/xbiome_pipeline_16s/test/data/meta_tutorial/mgs_tutorial_Oct2017/raw_data/run/p136N_R2.fastq.gz , /home/hanh/projects/xbiome_pipeline_16s/test/data/meta_tutorial/mgs_tutorial_Oct2017/raw_data/run/p136C_R2.fastq.gz] on stage 1914108708, context 215078753 in thread 1
I have to use inputs.tsv
in mergeMetaphlan
to force the inference as tsv files. It might be a problem if a splitted stage X also produces fastq.gz
files and I want to combine all the output fastq.gz
files from stage X's.
Any suggestions of the best practice for pooling results?
Thanks,
-hh1985
I've actually just committed a change which may be relevant to this.
What is happening is that Bpipe thinks the last stage (humann2) seemingly not declaring an output. In such a case it automatically forwards the previous input as the default input for the downstream stages to use, and in this case it incorrectly resolves it from the input to the overall parallel block, not the input of the last stage. The commit fixes this problem. It would be really helpful if you build from source off master and see if that corrects the behavior or not.
I'm curious though if the humann2
actually does declare an output or not? Or is it one of the prior stages creating the .tsv
files?
Hi @ssadedin,
The new code doesn't fix my problem:(
I tried to reproduce the error with simple code, bpipe works pretty well.
My bpipe code has some customized lib code (.jar) that maps host path to path in docker, which might cause the trouble.
In the following example,
cutprimer = {
requires outdir: "The directory for storing trimmed fastq"
output.dir = output.dir + '/' + outdir
def fprimer = REGISTER.locateParams('workflow', 'data').forward_primer
def rprimer = REGISTER.locateParams('workflow', 'data').reverse_primer
transform("*.fastq.gz") to(".cutP.fastq.gz") {
def obj = configEnv(stageName)
def cmd = "cutadapt -g $fprimer -G $rprimer -o ${obj.mapPathOut(file(output1).getAbsolutePath())} -p ${obj.mapPathOut(file(output2).getAbsolutePath())} ${REGISTER.rewireParams('stage', stageName)} ${obj.mapPathIn(input1 as String)} ${obj.mapPathIn(input2 as String)}"
exec obj.run2(cmd)
}
println outputs
forward outputs
}
adapter3 = {
println inputs
transform("*.cutP.fastq.gz") to(".txt") {
println inputs
exec "touch $outputs"
}
}
Bpipe.run {
"%_*.fastq.gz" * [cutPrimer.using(outdir: "cutPrimer")] + adapter3
}
outputs in cutprimer
are as expected: ***.cutP.fastq.gz. The first inputs
in stage adapter3
prints ***.fastq.gz, and the second inputs
prints like ***.cutP.fastq.gz
The log is like:
bpipe.PipelineCategory [1] INFO |11:25:57 There [[id:null, stageName:cutprimer, startMs:1533828357385, endMs:1533828357481, branch:hrk20180713-015-355-224, threadId:36, succeeded:true], [id:null, stageName:cutprimer, startMs:1533828357385, endMs:1533828357482, branch:hrk20180713-015-131-634, threadId:37, succeeded:true], [id:null, stageName:cutprimer, startMs:1533828357385, endMs:1533828357482, branch:hrk20180713-015-141-926, threadId:38, succeeded:true]] parallel paths in final stage
bpipe.PipelineCategory [1] INFO |11:25:57 Last merged outputs are [/home/hanh/projects/xbiome_pipeline_16s/test/rmPrimer/cutPrimer/hrk20180713-015-355-224_S50_L001_1.cutP.fastq.gz, /home/hanh/projects/xbiome_pipeline_16s/test/rmPrimer/cutPrimer/hrk20180713-015-355-224_S50_L001_2.cutP.fastq.gz, /home/hanh/projects/xbiome_pipeline_16s/test/rmPrimer/cutPrimer/hrk20180713-015-131-634_S48_L001_1.cutP.fastq.gz, /home/hanh/projects/xbiome_pipeline_16s/test/rmPrimer/cutPrimer/hrk20180713-015-131-634_S48_L001_2.cutP.fastq.gz, /home/hanh/projects/xbiome_pipeline_16s/test/rmPrimer/cutPrimer/hrk20180713-015-141-926_S45_L001_1.cutP.fastq.gz, /home/hanh/projects/xbiome_pipeline_16s/test/rmPrimer/cutPrimer/hrk20180713-015-141-926_S45_L001_2.cutP.fastq.gz]
bpipe.Utils [1] INFO |11:25:57 Setting output [/home/hanh/projects/xbiome_pipeline_16s/test/rmPrimer/cutPrimer/hrk20180713-015-355-224_S50_L001_1.cutP.fastq.gz, /home/hanh/projects/xbiome_pipeline_16s/test/rmPrimer/cutPrimer/hrk20180713-015-355-224_S50_L001_2.cutP.fastq.gz, /home/hanh/projects/xbiome_pipeline_16s/test/rmPrimer/cutPrimer/hrk20180713-015-131-634_S48_L001_1.cutP.fastq.gz, /home/hanh/projects/xbiome_pipeline_16s/test/rmPrimer/cutPrimer/hrk20180713-015-131-634_S48_L001_2.cutP.fastq.gz, /home/hanh/projects/xbiome_pipeline_16s/test/rmPrimer/cutPrimer/hrk20180713-015-141-926_S45_L001_1.cutP.fastq.gz, /home/hanh/projects/xbiome_pipeline_16s/test/rmPrimer/cutPrimer/hrk20180713-015-141-926_S45_L001_2.cutP.fastq.gz] on context 260084831 in thread 1
bpipe.PipelineCategory [1] INFO |11:25:57 Merged stage name is cutprimer_cutprimer_cutprimer_bpipe_merge
bpipe.PipelineStage [1] INFO |11:25:57 Stage 2 returned null as default inputs for next stage
bpipe.PipelineStage [1] INFO |11:25:57 Inputs are NOT being inferred from context.output (context.nextInputs=null)
bpipe.PipelineStage [1] INFO |11:25:57 Inferring nextInputs from inputs bpipe.PipelineContext@43f82e78.@input
bpipe.PipelineStage [1] INFO |11:25:57 No explicit output on stage 460570271 context 1140338296
bpipe.PipelineStage [1] INFO |11:25:57 Setting next inputs [/home/hanh/projects/xbiome_pipeline_16s/test/rmPrimer/rawdata/hrk20180713-015-355-224_S50_L001_2.fastq.gz, /home/hanh/projects/xbiome_pipeline_16s/test/rmPrimer/rawdata/hrk20180713-015-131-634_S48_L001_2.fastq.gz, /home/hanh/projects/xbiome_pipeline_16s/test/rmPrimer/rawdata/hrk20180713-015-141-926_S45_L001_2.fastq.gz, /home/hanh/projects/xbiome_pipeline_16s/test/rmPrimer/rawdata/hrk20180713-015-141-926_S45_L001_1.fastq.gz, /home/hanh/projects/xbiome_pipeline_16s/test/rmPrimer/rawdata/hrk20180713-015-131-634_S48_L001_1.fastq.gz, /home/hanh/projects/xbiome_pipeline_16s/test/rmPrimer/rawdata/hrk20180713-015-355-224_S50_L001_1.fastq.gz] on stage 460570271, context 1140338296 in thread 1
I notice that the id is null ...
A normal simple code gives log like:
bpipe.PipelineCategory [1] INFO |9:26:05 There [[id:0_0-0, stageName:step1, startMs:1533821165525, endMs:1533821165616, branch:abc, threadId:33, succeeded:true], [id:0_0-0, stageName:step1, startMs:1533821165525, endMs:1533821165616, branch:xyz, threadId:34, succeeded:true]] parallel paths in final stage
bpipe.PipelineCategory [1] INFO |9:26:05 Last merged outputs are [/home/hanh/code-repository/bpipe_test/xyz/abc_1.step1.txt, /home/hanh/code-repository/bpipe_test/xyz/abc_2.step1.txt, /home/hanh/code-repository/bpipe_test/xyz/xyz_1.step1.txt, /home/hanh/code-repository/bpipe_test/xyz/xyz_2.step1.txt]
bpipe.Utils [1] INFO |9:26:05 Setting output [/home/hanh/code-repository/bpipe_test/xyz/abc_1.step1.txt, /home/hanh/code-repository/bpipe_test/xyz/abc_2.step1.txt, /home/hanh/code-repository/bpipe_test/xyz/xyz_1.step1.txt, /home/hanh/code-repository/bpipe_test/xyz/xyz_2.step1.txt] on context 127702987 in thread 1
bpipe.PipelineCategory [1] INFO |9:26:05 Merged stage name is step1_step1_bpipe_merge
bpipe.PipelineStage [1] INFO |9:26:05 Inputs are NOT being inferred from context.output (context.nextInputs=[/home/hanh/code-repository/bpipe_test/xyz/abc_1.step1.txt, /home/hanh/code-repository/bpipe_test/xyz/abc_2.step1.txt, /home/hanh/code-repository/bpipe_test/xyz/xyz_1.step1.txt, /home/hanh/code-repository/bpipe_test/xyz/xyz_2.step1.txt])
bpipe.PipelineStage [1] INFO |9:26:05 No explicit output on stage 1884155890 context 237344028
bpipe.PipelineStage [1] INFO |9:26:05 Setting next inputs [/home/hanh/code-repository/bpipe_test/xyz/abc_1.step1.txt, /home/hanh/code-repository/bpipe_test/xyz/abc_2.step1.txt, /home/hanh/code-repository/bpipe_test/xyz/xyz_1.step1.txt, /home/hanh/code-repository/bpipe_test/xyz/xyz_2.step1.txt] on stage 1884155890, context 237344028 in thread 1