snakemake-workflows / docs

Documentation of the Snakemake-Workflows project

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

dropSeqPipe - Single cell data preprocessing snakemake workflow

Hoohm opened this issue · comments

Hello,

the latest version of my pipeline is trying to make it as a snakemake workflow.
I'm kindly asking for a review.

I have not yet worked on specific envs for each rule but this can be done in the future without too much effort.

Please tell me if there is anything else that I need to implement to pass the review.

Best wishes

Great, this looks very promising! I have only a few points that should be solved before inclusion here:

  1. I know there is no strict formatting guide yet, but I try to establish a certain standard in this organization. That is: (a) input, output, ... items in new lines and indented. (b) threads are only specified if != 1. No whitespace between the equal operator and left and right item in input, output, ...
  2. Use a wrapper when possible (e.g., for star, fastqc and multiqc), see here.
  3. Add a conda-directive to every other rule. It is fine if multiple rules point to the same conda environment file.

Great, I'll work on those issues soon :)

@johanneskoester I have a question.
I'm not sure how i can use the fastqc wrapper since I want two files at the same time and use 2 threads. Should I split them?
I'm also using the --extractoption which is not proposed in the wrapper.
Any ideas?
EDIT: found out how to do it. Had to split R1 and R2 though.

Might be that it is not possible. Then, feel free to not use it for now.

@johanneskoester
I'm almost done implementing the changes. I have an issue with trimmomatic.
I'm using the log as an output for a multiqc rule and hence I need the output listed. The problem is that the wrapper adds all the entries from the output key which leads to an error.

rule trim_single:
	input:
		'data/{sample}_trimmed_unmapped.fastq.gz'
	output:
		data='data/{sample}_filtered.fastq.gz',
		log='logs/{sample}_trimlog.txt'
	log:
		'logs/{sample}_trimlog.txt'
	params:
		trimmer=['LEADING:3','TRAILING:3','SLIDINGWINDOW:4:20','MINLEN:15', 'ILLUMINACLIP:$CONDA_PREFIX/share/trimmomatic/adapters/{}:2:30:10'.format(config['FILTER']['IlluminaClip'])],
		extra='-threads 2'
	threads: 2
	wrapper:
		'0.21.0/bio/trimmomatic/se'

Here is the command produced:
trimmomatic SE -threads 2 data/sample1_trimmed_unmapped.fastq.gz data/sample1_filtered.fastq.gz logs/sample1_trimlog.txt LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20 MINLEN:15 ILLUMINACLIP:$CONDA_PREFIX/share/trimmomatic/adapters/TruSeq3-PE.fa:2:30:10 > logs/sample1_trimlog.txt 2>&1

One way to solve this would be to change the wrapper to:

shell("trimmomatic SE {snakemake.params.extra} "
      "{snakemake.input} {snakemake.output[0]} "
      "{trimmer} "
      "{log}")

I tried to make a pull request on bitbucket but it failed I'm not sure why (there was no error message)

Well, if you really need it as an input, you can simply omit it in the log directive. However, this is discouraged since then, Snakemake will delete the log upon error (which is usually not what you want). I guess the problem is the multiqc wrapper?
Maybe you can simply omit that file from the input files of multiqc and add the log path as a param instead.

I'm not sure how to do this. If I don't have the log files from trimmomatic as input, how would the multiqc rule know when to run?

You are right, this is not convincing. I just modified Snakemake to allow log files as input. This will solve your problem, as you don't need to specify it as additional output file anymore.

I will release a new version today.

Well I guess it makes sense! Cool, I'll finish the asked changes after the snakemake update.

Snakemake 4.6.0 has been released. You can now use the log file as input to the multiqc rule. Make sure that it is only defined as log file, not as output file.

Hello,

coming back with an issue for a wrapper, this time the STAR wrapper.

This is what my rule looks like:

rule STAR_align:
	input:
		fq1="data/{sample}_filtered.fastq.gz",
		index=lambda wildcards: star_index_prefix + '_' + str(samples.loc[wildcards.sample,'read_length']) + '/SA'
	output:
		data=temp('logs/{sample}.Aligned.out.bam')
	log:
		log_out='logs/{sample}.Log.final.out'
	params:
		extra="""--outReadsUnmapped Fatsx\
			 	--outFilterMismatchNmax {}\
			 	--outFilterMismatchNoverLmax {}\
			 	--outFilterMismatchNoverReadLmax {}\
			 	--outFilterMatchNmin {}""".format(
				config['STAR_PARAMETERS']['outFilterMismatchNmax'],
				config['STAR_PARAMETERS']['outFilterMismatchNoverLmax'],
				config['STAR_PARAMETERS']['outFilterMismatchNoverReadLmax'],
				config['STAR_PARAMETERS']['outFilterMatchNmin']),
		index=lambda wildcards: star_index_prefix + '_' + str(samples.loc[wildcards.sample,'read_length']) + '/'	
	threads: 24
	wrapper:
		'0.21.0/bio/star/align'

I get some warnings

/share/big2/Test_data/DropSeq/.snakemake/scripts/j8re728z.wrapper.py:18: SyntaxWarning: assertion is always true, perhaps remove parentheses?
  assert(fq1 is not None, "input-> fq1 is a required input parameter")
/share/big2/Test_data/DropSeq/.snakemake/scripts/j8re728z.wrapper.py:23: SyntaxWarning: assertion is always true, perhaps remove parentheses?
  assert(len(fq1) == len(fq2), "input-> equal number of files required for fq1 and fq2")
STAR --outReadsUnmapped Fatsx			 	--outFilterMismatchNmax 10			 	--outFilterMismatchNoverLmax 0.3			 	--outFilterMismatchNoverReadLmax 1 --outFilterMatchNmin 0 --runThreadN 8 --genomeDir /home/patrick/big/references/mouse_91/STAR_INDEX/SA_100/ --readFilesIn data/sample1_filtered.fastq.gz  --readFilesCommand zcat --outSAMtype BAM Unsorted --outFileNamePrefix logs/ --outStd Log  > logs/sample1.Log.final.out 2>&1

It still runs fine, but I'm not sure this is intended.

The warnings are fixed now, but the next wrapper release waits on another issue. For you, it is fine to go on ignoring them for now. How close are you to inclusion in here?

The changes in code have been done. The wrapper are now in place and working. I'm testing now the few datasets I have to check that everything runs fine, I'll probably push it early next week.

It's a detail, but I want to know if you had any naming convention for config files. I went with camelCase for "end variables" and CAPITAL for subsections.

SUBSECTION:
    SUBSECTION2:
        variableOne:

I would use lower case, and foo-bar for composed words. Camel case does not nicely fit to the rest of Python, where is is only used for clases. Instead of foo-bar, you can also use snake_case. But lower case is, I think, preferred by most, because it is easier on the eyes.

I would definitely like a different naming convention between sections and variables.

would this be ok with you?

FILTER:
    cell-barcode:
        start: 1
        end: 6
        min-quality: 20
        num-below-quality: 1

The main reason is that it fits the different major steps of the pipeline. It's easy to read/understand which part of the config file has an influence on which part of the pipeline.

Yeah, that's ok. To me, all lowercase would still be sufficient, because you can also add some comments, or empty lines to hightlight the sections, but your approach is also fine.

So there seems to be a bug regarding the use of logfiles as input.
I'm running a rule that depends on the STAR logs and it wants to rerun STAR although the files exist.
The reason for rerunning is "missing output files"

rule STAR_align:
	input:
		fq1="data/{sample}_filtered.fastq.gz",
		index=lambda wildcards: star_index_prefix + '_' + str(samples.loc[wildcards.sample,'read_length']) + '/SA'
	output:
		data=temp('data/{sample}/Aligned.out.bam')
	log:
		log_out='data/{sample}/Log.final.out'
	params:
		extra="""--outReadsUnmapped Fatsx\
			 	--outFilterMismatchNmax {}\
			 	--outFilterMismatchNoverLmax {}\
			 	--outFilterMismatchNoverReadLmax {}\
			 	--outFilterMatchNmin {}""".format(
				config['STAR_PARAMETERS']['outFilterMismatchNmax'],
				config['STAR_PARAMETERS']['outFilterMismatchNoverLmax'],
				config['STAR_PARAMETERS']['outFilterMismatchNoverReadLmax'],
				config['STAR_PARAMETERS']['outFilterMatchNmin']),
		index=lambda wildcards: star_index_prefix + '_' + str(samples.loc[wildcards.sample,'read_length']) + '/'	
	threads: 24
	wrapper:
		'0.21.0/bio/star/align'

There is one thing that might be related. I'm not sure how snakemake works internally, but since the actual output file doesn't exist because it's temp, could it be that while it checks for the existing log file, it thinks it should have those bam files and rerun it for that reason? Although the reason message would be pointing towards the wrong file.

I have just checked it locally, and there is no such problem. It must be another job in your DAG the has to run and needs the temp bam file.

Odd, maybe I'm missing something. Here is the list of rules to run

rule STAR_align:
    input: data/L2-SCRB-Opt-2-1C_filtered.fastq.gz, /naslx/projects/pr62lo/di49qar/reference/mouse_91/STAR_INDEX/SA_100/SA
    output: data/L2-SCRB-Opt-2-1C/Aligned.out.bam
    log: data/L2-SCRB-Opt-2-1C/Log.final.out
    jobid: 3
    reason: Missing output files: data/L2-SCRB-Opt-2-1C/Log.final.out
    wildcards: sample=L2-SCRB-Opt-2-1C


localrule plot_yield:
    input: logs/L2-SCRB-Opt-2-1C_CELL_barcode.txt, logs/L2-SCRB-Opt-2-1C_UMI_barcode.txt, logs/L2-SCRB-Opt-2-1C_reads_left.txt, data/L2-SCRB-Opt-2-1C/Log.final.out, logs/L2-SCRB-Opt-2-1C_reads_left_trim.txt
    output: plots/yield.pdf
    jobid: 0
    reason: Missing output files: plots/yield.pdf; Input files updated by another job: data/L2-SCRB-Opt-2-1C/Log.final.out

Shutting down, this might take some time.
Job counts:
	count	jobs
	1	STAR_align
	1	plot_yield
	2

this is the command I run: snakemake plot_yield --dryrun -r

There are a few steps between them, but normally when I delete the plot from the plot_yield rule, it just runs the plot again based on the old logfiles.

I'll look more into it, but if you have an idea of what to look for, it would help a lot.

This is indeed a bit weird. It also explicitly lists the log file to be missing... are you sure it is there? What is the output of ls -l data/L2-SCRB-Opt-2-1C/Log.final.out?

Yeah the file is there, no issue there.
-rw-r--r-- 1 di49qar pr62lo 1857 Feb 16 12:01 data/L2-SCRB-Opt-2-1C/Log.final.out
I just swapped to the old version of the STAR_align rule, not using the wrapper, and now it works again. Only plot_yield is invoked.

I don't see how this could be related to the wrapper. Could you post both versions of the rule?

None wrapper dependent

rule STAR_align:
	input:
		data="data/{sample}_filtered.fastq.gz",
		index=lambda wildcards: star_index_prefix + '_' + str(samples.loc[wildcards.sample,'read_length']) + '/SA'
	output:
		sam=temp('logs/{sample}/Aligned.out.bam'),
		log_out='logs/{sample}/Log.final.out'
	params:
		prefix='logs/{sample}/',
		outFilterMismatchNmax=config['STAR_PARAMETERS']['outFilterMismatchNmax'],
		outFilterMismatchNoverLmax=config['STAR_PARAMETERS']['outFilterMismatchNoverLmax'],
		outFilterMismatchNoverReadLmax=config['STAR_PARAMETERS']['outFilterMismatchNoverReadLmax'],
		outFilterMatchNmin=config['STAR_PARAMETERS']['outFilterMatchNmin'],
		read_length=lambda wildcards: int(samples.loc[wildcards.sample,'read_length'])
	threads: 24
	shell:
		"""
			--genomeDir {star_index_prefix}_{params.read_length}/\
			--readFilesCommand zcat\
			--runThreadN {threads}\
			--readFilesIn {input.data}\
			--outSAMtype BAM Unsorted\
			--outReadsUnmapped Fatsx\
			--outFileNamePrefix logs/{params.prefix}\
			--outFilterMismatchNmax {params.outFilterMismatchNmax}\
			--outFilterMismatchNoverLmax {params.outFilterMismatchNoverLmax}\
			--outFilterMismatchNoverReadLmax {params.outFilterMismatchNoverReadLmax}\
			--outFilterMatchNmin {params.outFilterMatchNmin}"""

I tried another rule based on other logfiles, I get the same issue.
I also tried adding the logfile as an output in the wrapper of the STAR_align rules, it doesn't help

I think I might have fixed it for the STAR_align. I added the log file to the output.
Trying it on the trimmomatic now.

Yeah, but that should not be necessary. Could you try to create a minimal example, so that I can debug it on my side?

Wait, I think I have found the problem and am now also able to reproduce. Working on a fix now.

Ok, fixed in the master branch. Thanks for reporting. I will create a new release next week.

Cool! Glad I could help.
I hope this is the last fix. I'll be able to push the new release just after yours.

EDIT: I got the minimal example if still needed

I tried the fixed version, it works properly.

New version is released as 4.7.0.

I'm running the last tests today, should be able to push the new version today.

Cool! Looking forward to move it here!

:( an old bug came back with the wrapper modification on STAR.
It's kind of tricky to explain.

Basically I have a rule in split_species that is grouping species for each sample. I think it's from the plot_barnyard, but basically, it tries to find a combination of SAMPLE_SPECIES for the sample wildcard instead of having only SAMPLE. Of course, this doesn't go with the index in the mapping step:

rule STAR_align:
	input:
		fq1="data/{sample}_filtered.fastq.gz",
		index=lambda wildcards: star_index_prefix + '_' + str(samples.loc[wildcards.sample,'read_length']) + '/SA'
	output:
		temp('data/{sample}/Aligned.out.bam')
	log:
		'data/{sample}/Log.final.out'
	params:
		extra="""--outReadsUnmapped Fatsx\
			 	--outFilterMismatchNmax {}\
			 	--outFilterMismatchNoverLmax {}\
			 	--outFilterMismatchNoverReadLmax {}\
			 	--outFilterMatchNmin {}""".format(
				config['STAR_PARAMETERS']['out-filter-mismatch-nmax'],
				config['STAR_PARAMETERS']['out-filter-mismatch-nover-lmax'],
				config['STAR_PARAMETERS']['out-filter-mismatch-nover-read-lmax'],
				config['STAR_PARAMETERS']['out-filter-match-nmin']),
		index=lambda wildcards: star_index_prefix + '_' + str(samples.loc[wildcards.sample,'read_length']) + '/'	
	threads: 24
	wrapper:
		"0.22.0/bio/star/align"

Here is the error:

InputFunctionException in line 10 of /share/big2/Test_data/DropSeq_mixed/rules/map.smk:
KeyError: 'the label [Experiment_MOUSE] is not in the [index]'
Wildcards:
sample=Experiment_MOUSE

I'm looking into it and might create a minimal example to illustrate the issue.

Maybe it is a good idea to put a global wildcard_constraint on sample. E.g.:

wildcard_constraints:
    sample="({})".format("|".join(samples.index))

Then, you should be better able to see where the problem is.

I added the constraint and it seems to work now. Could you explain me what this does?

To me it looks like I constraint the samples to choose only from samples.index for sample wildcards.

Yes, exactly. It generates a regular expression that only matches those values. Sometimes this is necessary because matching can be ambiguous if you have multiple wildcards.

Do you want to move the repo here now? Or are there remaining issues? Do you already have Travis ci based tests enabled?

I'm not familiar with Travis ci. I have subscribed and I'll look into it, but before that, I will push the new release.
A lot of modifications that I want to push and make available.
I hope this was the last big issue.

So how am I supposed to "move" the repo?

There is a transfer ownership section in the settings.

For an example Travis setup have a look at the rnaseq workflow.

Ok, change of plans, we fork it from here, and create a backstroke that automatically creates a PR from each of your commits to the main repo. Before I can fork, I still need travis tests to be set up. Any progress on this? Test data is available here: https://github.com/snakemake-workflows/ngs-test-data, you can include it as a submodule analog to https://github.com/snakemake-workflows/rna-seq-star-deseq2.

Ok, I just pushed the latest version (0.31) and I can now go on and use travis.

I need some test data that is specific for single cell. I'll try to make one similar to the one you have for bulk NGS.

Ok! Note that the testing is basically only for checking that the tools and steps work. It is not a benchmark, so everything can be very small.
Also, I have just seen that the main Snakefile still contains a lot of redundant code. If you really want all these subtargets, try to share the expand invocations between them and the rule all. Usually, it is also sufficient to just list the final plots and tables in the all rule. Definitely not intermediate files like the bam files.

Not sure what you mean by "share the expand invocations between them".

Travis seems pretty straight forward to use. The only mystery to me is this submodule for the test data.
Any tips to do it asap?

I mean that if you have the same expand statement twice (in two rules), just define a variable at the top, and refer to that variable from the rules:

bams = expand("...", ...)

rule all:
    input:
        bams, plots, ...

The idea with the git submodule is that the test data becomes reusable between different repositories. Further, when people checkout your repo, they don't need to check it out together with the test data, because that is in a submodule which is only checked out on request. See here: https://github.com/snakemake-workflows/rna-seq-star-deseq2/tree/master/.test

Any update? Can I help?

Yes, I have made a repo for the test data here
Just forked yours and changed it up a bit.

Those past two weeks were difficult and I didn't have much time to work on including the test data.

If you can help me out in terms of how I should add the submodule, I think it should not be that much more work to run it properly.

I don't think you need to fork it if you don't need any real changes. All you need to do is to issue

mkdir -p .test/data
git add submodule https://github.com/snakemake-workflows/ngs-test-data.git

in your repo clone.
Then, you commit the changes and push. This should be enough. For details, see here.

I needed to change the reads because I need some UMI and cell barcodes in read1.
That seems super easy, gonna try it out shortly.

I see. Maybe you can add the modified reads as additional samples in the NGS test data repo (as a pull request)?

I think there is still a problem. You got fastq files paths in units.tsv. I just got sample names in samples.csv. dropSeqPipe will look for files in data/ as data/{sample_name}_R1.fastq.gz
That's what I already did on the forked one.
And this is just one case of testing. There will be for whitelisted barcodes and double species.

It would be really nice to work on one single repo for testing data. Would be useful for other people as well.

A "simple" fix is that I add a default path for data in the config.yaml. This would let it be flexible enough for the whole pipeline and allow a default path for testing purposes.

Ok, I'm tying to build on travis but it doesn't "see" the travis_integration branch. I did use a safelist and I did run a "failed" build on the master branch.
Have you had similar problems? Past the safelist and having 0 builds on the git project, I haven't found new clues.

The build is a bit messy with moving files from .test/data, but it should work if the build starts.

Once you have registered the repo in travis, it is usually sufficient to push a new commit to the branch. This should then trigger the build.

Ok, travis implemented!

Nice! I have added some comments to your commit.

Ok, I've made the requests and it ran through :)

But this leaves me with a question.

Are there two different relative paths for data and rules/scripts?

Since only the .test folder is redirected, this means it will take this path as default path for the pipeline to run. But All the rules and scripts are in a folder above .test. Hence my question above.

When you use --directory, it will work in the given path, but still take the Snakefile and any additional source files relative to the directory from where you invoke snakemake.

Apart from that, there seem only two items missing for inclusion that I did not notice before (sorry that it takes so long). First, we need drop-seq-tools in bioconda, such that no additional setup is required. Any particular problems that make this impossible? And then, it would be nice to have a readme that directly contains stepwise usage instructions that follow the other workflows we have here. Your approach is in principle fine as well, but it is easier for users to browse snakemake-workflows if all pipelines follow the same pattern. Would that be agreeable to you?

i'm here to learn!

This might take some time then. The developer has no problem with dropseqtool being on conda, I already asked him.

I just don't know java and hence have no clue of how to do this properly.
I will try to find someone who could help me out. If you know someone, please let me know :)

You can take the picard recipe as a blueprint: https://github.com/bioconda/bioconda-recipes/blob/f5eb63e30a76fd13c28663786d219c9f7750267c/recipes/picard/meta.yaml

Let me know if problems occur. I will try to help as fast as possible.

Did you have success with the conda recipe?

I have started going down the rabbit hole yes but I'm clearly not there yet. Maybe you have a hint for me. I would like to use conda skeleton but there is none for java code. which one should I use?

Sorry for the late reply. You can follow the bioconda docs for JAVA: http://bioconda.github.io/guidelines.html#java

Hey, someone uploaded drop-seq tools on conda for us!!
Is there anything left to do to validate dropSeqPipe as a workflow?

Sorry for the late reply. The last weeks were pretty busy.

  1. It would be nice if you could move the changelog to a separate file and try to harmonize the README with the rest of the workflows in snakemake-workflows. This makes it easier for users to understand how to use the workflow.
  2. Please ensure that all Snakefiles are indented with 4 spaces, not 8.
  3. Add a .gitattributes file like this for syntax highlighting in github.

Afterwards, I think we are ready to fork.

It's been a long time sorry. Got busy with other features for the pipeline.
I've changed a few things though. The docs are now generated via mkdocs directly on gh-pages.
The branch I'm working on for the formatting and workflow standard is this one

Do you think it is time to fork now?

Yes. Release 0.4 just came out! Very happy about it.

Thank you! Forked and announced on Twitter. Really great work!

Thank you too! This wouldn't be possible without your many contributions!

Could you also put the TUM in my affiliations? If only one is possible, then TUM because that is where my main PI is at.

Sure, done. Sorry, I did not see that. Feel free to create a PR to update your affiliations, maybe also adding your preferred homepage link.