NationalGenomicsInfrastructure / piper

A genomics pipeline build on top of the GATK Queue framework

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add snpEff

ewels opened this issue Β· comments

It would be great if we could get GATK to run snpEff by adding the -o gatk (see the snpEff documentation)

This is fairly urgent as the report scripts look for the snpEff output. Tack!

Yes - this was already on the todo-list, but I'll get on to all of this stuff asap.

Great, thanks! For now we're just running this manually. If you have any suggestions for where we should run it or what we should call stuff, let us know :)

I've started to look into this now @ewels. A question: do you think that adding the --snpEffFile <snp_eff_output_file> to the VariantAnnotator would be enough? As compared to running SnpEff in stand-alone mode? The docs for this is available here: https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_annotator_VariantAnnotator.php

The main draw-back seems to be that this only reports "the most biologically significant effect listed". Is this enough? Or do you think it's better to report as much as possible?

Hmm, have I misunderstood how this works? I was thinking that we could add a command line argument and this would trigger GATK to run SnpEff, but re-reading the docs now it sounds more like it's about getting GATK to annotate VCF files with SnpEff annotation? So you still have to run SnpEff on it's own anyway?

Anyway, general I think it's best to report as much as possible for this, the only advantage of the parameter thing was to cut down on workload. How much work is it to add SnpEff to the pipeline?

If I understand correctly you generate a resource file with SnpEff for GATK to use. So you have to run it once - and then you use that file.

However, it's shouldn't to much work adding SnpEff to the pipeline itself. I'm just doing my research at the moment. πŸ˜„

Yeah - so if we have to run SnpEff anyway, then I guess we can just leave the GATK step alone?

Even better! Thanks..

Have you've been using the snpEff_summary.html to extract info for the reports or do you want me to run with the -csvStats option?

I have been running with the -csvStatsoption. And yes we are using snpEff_summary.htmto extract info.

Ok - there is one problem with that. If I run with the -csvStats option there is no snpEff_summary.htm generated. So unfortuneatly it does not seem possible to get both. What would you guys prefer?

See my answer was a bit unclear. When running snpEff run without the -csvStats so the html file will be generated. Sorry for the confusion. I have also tried to generate the pdf file but it seems its not possible to generate pdf and html file in same run.

Cool! Then I get it. πŸ˜„

We actually scrape both currently as there's one bit of info in the HTML which isn't in the CSV. Very annoying :/

On 8 Jan 2015, at 17:01, Johan Dahlberg notifications@github.com wrote:

Cool! Then I get it.

β€”
Reply to this email directly or view it on GitHub.

I'm not sure I follow. Am I to understand that you use both files? If so which one is the most critical to you to get? It doesn't feel very neat to run the program twice with different arguments to get both but if that's the only way to get around it (short of patching snpEff - I guess) I guess we might have to do that? Or what's your preference @ewels and @parlundin?

@ewels knows this best, I thought that it was enough with the html for that. But when I look back at my last snpEff run I have both the csv file and the html. So my bad, snpEff is failrly fast to run at least.

I'm experimenting with different options now to see if I can get it to generate both files at the same time.

Well I've now looked at the snpEff code and concluded that it's not possible to get both summaries at the same time:

        if (createSummary && (summaryFile != null)) {
            // Creates a summary output file
            if (verbose) Timer.showStdErr("Creating summary file: " + summaryFile);
            if (createCsvSummary) ok &= summary(SUMMARY_CSV_TEMPLATE, summaryFile, true);
            else ok &= summary(SUMMARY_TEMPLATE, summaryFile, false);

            // Creates genes output file
            if (verbose) Timer.showStdErr("Creating genes file: " + summaryGenesFile);
            ok &= summary(SUMMARY_GENES_TEMPLATE, summaryGenesFile, true);
        }

So my preference would be that you guys settle for either the csv or the html and tell me of your preference and then I will start going about adding snpEff into the WGS pipeline.

@parlundin and @ewels - I've now written (I think) most of the code necessary to add snpEff into the WGS pipeline. However, I need to get feedback from you guys about if you want the csv or the html file before I can proceed. I hope to be able to test this next week and have a release out the week after that (but considering that the next few weeks will be extremely busy I'm not 100% I will make that deadline).

Sorry for the delay in replying, I'm away at the moment. CSV is the best if we have to choose. Ideally SnpEff should put the missing field (that's in the HTML) into the CSV, maybe we can email the authors about that.

CSV for now is good though.

Phil

On 9 Jan 2015, at 16:55, Johan Dahlberg notifications@github.com wrote:

@parlundin and @ewels - I've now written (I think) most of the code necessary to add snpEff into the WGS pipeline. However, I need to get feedback from you guys about if you want the csv or the html file before I can proceed. I hope to be able to test this next week and have a release out the week after that (but considering that the next few weeks will be extremely busy I'm not 100% I will make that deadline).

β€”
Reply to this email directly or view it on GitHub.

Ok. Got it. πŸ‘