Add snpEff

Question

Add snpEff

ewels opened this issue 10 years ago · comments

Phil Ewels commented 10 years ago

It would be great if we could get GATK to run snpEff by adding the -o gatk (see the snpEff documentation)

This is fairly urgent as the report scripts look for the snpEff output. Tack!

Johan Dahlberg commented 9 years ago

Yes.

Johan Dahlberg commented 9 years ago

Done!

Johan Dahlberg · Answer 1 · Fri Nov 28 2014 18:34:32 GMT+0800 (China Standard Time)

Yes - this was already on the todo-list, but I'll get on to all of this stuff asap.

Phil Ewels · Answer 2 · Fri Nov 28 2014 18:37:59 GMT+0800 (China Standard Time)

Great, thanks! For now we're just running this manually. If you have any suggestions for where we should run it or what we should call stuff, let us know :)

Johan Dahlberg · Answer 3 · Wed Jan 07 2015 20:39:42 GMT+0800 (China Standard Time)

I've started to look into this now @ewels. A question: do you think that adding the --snpEffFile <snp_eff_output_file> to the VariantAnnotator would be enough? As compared to running SnpEff in stand-alone mode? The docs for this is available here: https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_annotator_VariantAnnotator.php

The main draw-back seems to be that this only reports "the most biologically significant effect listed". Is this enough? Or do you think it's better to report as much as possible?

Phil Ewels · Answer 4 · Wed Jan 07 2015 21:24:01 GMT+0800 (China Standard Time)

Hmm, have I misunderstood how this works? I was thinking that we could add a command line argument and this would trigger GATK to run SnpEff, but re-reading the docs now it sounds more like it's about getting GATK to annotate VCF files with SnpEff annotation? So you still have to run SnpEff on it's own anyway?

Anyway, general I think it's best to report as much as possible for this, the only advantage of the parameter thing was to cut down on workload. How much work is it to add SnpEff to the pipeline?

Johan Dahlberg · Answer 5 · Wed Jan 07 2015 21:49:49 GMT+0800 (China Standard Time)

If I understand correctly you generate a resource file with SnpEff for GATK to use. So you have to run it once - and then you use that file.

However, it's shouldn't to much work adding SnpEff to the pipeline itself. I'm just doing my research at the moment. 😄

Phil Ewels · Answer 6 · Wed Jan 07 2015 21:51:45 GMT+0800 (China Standard Time)

Yeah - so if we have to run SnpEff anyway, then I guess we can just leave the GATK step alone?

Phil Ewels · Answer 7 · Wed Jan 07 2015 22:33:28 GMT+0800 (China Standard Time)

Even better! Thanks..

Johan Dahlberg · Answer 8 · Thu Jan 08 2015 20:54:27 GMT+0800 (China Standard Time)

Have you've been using the snpEff_summary.html to extract info for the reports or do you want me to run with the -csvStats option?

parlundin · Answer 9 · Thu Jan 08 2015 22:43:27 GMT+0800 (China Standard Time)

I have been running with the -csvStatsoption. And yes we are using snpEff_summary.htmto extract info.

Johan Dahlberg · Answer 10 · Thu Jan 08 2015 23:11:21 GMT+0800 (China Standard Time)

Ok - there is one problem with that. If I run with the -csvStats option there is no snpEff_summary.htm generated. So unfortuneatly it does not seem possible to get both. What would you guys prefer?

parlundin · Answer 11 · Thu Jan 08 2015 23:13:21 GMT+0800 (China Standard Time)

See my answer was a bit unclear. When running snpEff run without the -csvStats so the html file will be generated. Sorry for the confusion. I have also tried to generate the pdf file but it seems its not possible to generate pdf and html file in same run.

Johan Dahlberg · Answer 12 · Fri Jan 09 2015 00:01:50 GMT+0800 (China Standard Time)

Cool! Then I get it. 😄

Phil Ewels · Answer 13 · Fri Jan 09 2015 06:44:04 GMT+0800 (China Standard Time)

We actually scrape both currently as there's one bit of info in the HTML which isn't in the CSV. Very annoying :/

On 8 Jan 2015, at 17:01, Johan Dahlberg notifications@github.com wrote:

Cool! Then I get it.

—
Reply to this email directly or view it on GitHub.

Johan Dahlberg · Answer 14 · Fri Jan 09 2015 16:37:11 GMT+0800 (China Standard Time)

I'm not sure I follow. Am I to understand that you use both files? If so which one is the most critical to you to get? It doesn't feel very neat to run the program twice with different arguments to get both but if that's the only way to get around it (short of patching snpEff - I guess) I guess we might have to do that? Or what's your preference @ewels and @parlundin?

parlundin · Answer 15 · Fri Jan 09 2015 17:00:59 GMT+0800 (China Standard Time)

@ewels knows this best, I thought that it was enough with the html for that. But when I look back at my last snpEff run I have both the csv file and the html. So my bad, snpEff is failrly fast to run at least.

Johan Dahlberg · Answer 16 · Fri Jan 09 2015 17:02:59 GMT+0800 (China Standard Time)

I'm experimenting with different options now to see if I can get it to generate both files at the same time.

Johan Dahlberg · Answer 17 · Fri Jan 09 2015 17:29:17 GMT+0800 (China Standard Time)

Well I've now looked at the snpEff code and concluded that it's not possible to get both summaries at the same time:

        if (createSummary && (summaryFile != null)) {
            // Creates a summary output file
            if (verbose) Timer.showStdErr("Creating summary file: " + summaryFile);
            if (createCsvSummary) ok &= summary(SUMMARY_CSV_TEMPLATE, summaryFile, true);
            else ok &= summary(SUMMARY_TEMPLATE, summaryFile, false);

            // Creates genes output file
            if (verbose) Timer.showStdErr("Creating genes file: " + summaryGenesFile);
            ok &= summary(SUMMARY_GENES_TEMPLATE, summaryGenesFile, true);
        }

So my preference would be that you guys settle for either the csv or the html and tell me of your preference and then I will start going about adding snpEff into the WGS pipeline.

Johan Dahlberg · Answer 18 · Fri Jan 09 2015 23:55:22 GMT+0800 (China Standard Time)

@parlundin and @ewels - I've now written (I think) most of the code necessary to add snpEff into the WGS pipeline. However, I need to get feedback from you guys about if you want the csv or the html file before I can proceed. I hope to be able to test this next week and have a release out the week after that (but considering that the next few weeks will be extremely busy I'm not 100% I will make that deadline).

Phil Ewels · Answer 19 · Sat Jan 10 2015 00:20:00 GMT+0800 (China Standard Time)

Sorry for the delay in replying, I'm away at the moment. CSV is the best if we have to choose. Ideally SnpEff should put the missing field (that's in the HTML) into the CSV, maybe we can email the authors about that.

CSV for now is good though.

Phil

On 9 Jan 2015, at 16:55, Johan Dahlberg notifications@github.com wrote:

@parlundin and @ewels - I've now written (I think) most of the code necessary to add snpEff into the WGS pipeline. However, I need to get feedback from you guys about if you want the csv or the html file before I can proceed. I hope to be able to test this next week and have a release out the week after that (but considering that the next few weeks will be extremely busy I'm not 100% I will make that deadline).

—
Reply to this email directly or view it on GitHub.

Johan Dahlberg · Answer 20 · Mon Jan 12 2015 17:26:02 GMT+0800 (China Standard Time)

Ok. Got it. 👍