uab-cgds-worthey / DITTO

Variant Deleteriousness prediction tool using AI

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

VCF annotation using VEP - [merged]

ManavalanG opened this issue · comments

Merges vep_annotation -> master

Annotates variants in VCF using Variant Effect Predictor (VEP).

The following may need review at some level. Note this other stuff may need review as well, as this is not exhaustive.

  • Snakemake pipeline
  • Source code directory structure. It may be a good idea to spend some time on this so the repo with several moving parts will remain manageable.
  • Documentation. I kept it simple; let me know if it needs more info.
  • Datasets used and their versions

added 1 commit

  • cc59824e - updates clinvar path

Compare with previous version

added 1 commit

  • b18b4edf - adds annotated test vcf

Compare with previous version

We currently obtain PolyPhen and SIFT scores vis dbNSFP. However VEP is capable of natively annotating them without extra work via options --sift, --polyphen, and they recommend as much as well as dbNSFP includes only the non-synonymous variants.

Should we switch?

In GitLab by @wilkb777 on Jan 27, 2021, 15:35

yes, that's totally fine and I think we should use those options. The originating source for that info doesn't have to be dnNSFP, it just happens to be the most convenient way to obtain that info in a bulk download.

In GitLab by @wilkb777 on Jan 28, 2021, 09:14

Commented on variant_annotation/configs/env/vep.yaml line 7

  1. Is this particular version of BCFTools needed? I've been using 1.10.2 which has a big bug fix (https://github.com/samtools/bcftools/releases/tag/1.10.2) as well as the 1.10 release has a bunch of bug fixes in it too. 1.11 is available now as well but that mostly appears to be a feature enhancement release.
  2. Is this particular version of tabix needed? BCFTools comes with tabix because it uses tabix indexes for some of its processing commands. I'd recommend using the accordingly packaged version of tabix to avoid issues unless absolutely needed.

added 7 commits

  • 665db894 - fixes resources config bug
  • c06137e4 - uses VEP's sift and polyphen; changes threads to 8
  • 68723b88 - adds warning file
  • 5b5bec96 - breaks long strings to multi-lines
  • b737497d - bumps cluster partition; update test output
  • d8368088 - bgzips output vcf
  • 5a982790 - updates conda env

Compare with previous version

  1. bcftools v1.11 has conflict with VEP and after noticing this, I just went back to my older setting which was v1.9. I tested v1.10.2 just now and it works without conflict.
  2. Good catch. Checking it now, even VEP includes tabix, which I didn't realize as they recommend installing tabix separately in their documentation. Bioconda recipe probably just chose to include them I guess.

Changed them now.

Also made these changes recently:

  • Bumped annotation to use 8 instead of 4 threads (#1)
  • Made the output file bgzipped - as unzipped files are massive (~12GB) and they may be I/O expensive
  • Upgraded cluster partition to use
  • Style updates and minor improvements.

In GitLab by @wilkb777 on Jan 28, 2021, 12:29

Commented on variant_annotation/configs/env/vep.yaml line 7

Nice, I figured it was something like that but thought it was worth checking.

In GitLab by @wilkb777 on Jan 28, 2021, 12:29

resolved all threads

In GitLab by @wilkb777 on Jan 28, 2021, 17:05

Commented on variant_annotation/src/Snakefile line 136

            | bcftools view -Oz \

switch to using bcftools to generate bgzip output

In GitLab by @wilkb777 on Jan 28, 2021, 17:05

Commented on variant_annotation/src/Snakefile line 137

remove as bcftools can do compression directly

In GitLab by @wilkb777 on Jan 28, 2021, 17:06

Commented on variant_annotation/src/Snakefile line 115

        bcftools view {input.calls} | \

removing accidental parentheses

In GitLab by @wilkb777 on Jan 28, 2021, 17:06

Commented on variant_annotation/src/Snakefile line 138

            > {output.calls}

removing accidental parentheses

In GitLab by @wilkb777 on Jan 28, 2021, 17:12

marked the checklist item Source code directory structure. It may be a good idea to spend some time on this so the repo with several moving parts will remain manageable. as completed

In GitLab by @wilkb777 on Jan 28, 2021, 17:12

marked the checklist item Datasets used and their versions as completed

In GitLab by @wilkb777 on Jan 28, 2021, 17:17

@ManavalanG I made #3 to remind us to consolidate the repo structure once major components are all merged.

In GitLab by @wilkb777 on Jan 28, 2021, 17:21

I think it'd be best to move datasets.yaml out of this repo and make it a user config file with a hardcoded path like ~/.ditto_datasets.yaml and then layout instructions on its format in the README. That way we won't be sharing any of our internal lab file structure when making the repo public.

In GitLab by @wilkb777 on Jan 28, 2021, 17:27

I'm torn on how much info about the custom datasets needs to be distributed. Specifically the custom formatting done for GERP and dbNSFP usage. I do not think we need to go crazy on that, but maybe just put a short description on how others could produce them. Thoughts @ManavalanG ?

In GitLab by @wilkb777 on Jan 28, 2021, 17:28

I guess adding version numbers of external datasets would be good to put in the README with this change so we know what version was used when and the expected file format.

In GitLab by @wilkb777 on Jan 28, 2021, 17:30

@ManavalanG I'm done with the initial review! I'll re-review when the update to the Snakefile is made to make the input VCF configurable and then ping again when I change the run script to handle commandline specification of input VCF and local vs slurm job execution.

added 2 commits

  • 458bd8dc - refactors vep to directly write bgzipped output file
  • 4d17bff0 - accepts input vcf via config

Compare with previous version

Switched to VEP writing output file directly.

Switched to VEP writing output file directly.

This part is refactored now.

when the update to the Snakefile is made to make the input VCF configurable

This part is done now.

Yeah that sounds good to me.

  • dbNSFP formatting we adopted is quite similar to what VEP (plugin) folks suggested. So I think we can just point to that.
  • For GERP, we can just mention the command used for processing.

added 3 commits

  • e18867d2 - adds test output
  • e77ea1f0 - inputs datasets config via cli
  • 5531cfd2 - downgrades cluster partition for annotation

Compare with previous version

I agree with the first part and I like the idea. Now it needs to be supplied to snakemake via CLI.

For the second part involving README, I'm conflicted. My worry is that it is easy to forget to update README as and when changes are made. But I do see the value of storing it somewhere. Any other way we can track this?

added 1 commit

  • a5d130ce - removes unused code

Compare with previous version

How about using symlinked files instead? We can store relevant info this way but don't have to remember to update readme.

In GitLab by @wilkb777 on Jan 29, 2021, 08:13

I'm an idiot and don't know why I put "we" in the above comment. I was thinking documenting the version with the formatting (i.e. just use default file versus custom format) on a basic level for making the repo public and making the work reproducible for others. I should've just combined this with the comment about custom datasets documentation.

In GitLab by @wilkb777 on Jan 29, 2021, 09:31

added 1 commit

  • 7f5c3d3c - making changes to add CLI specification of input info

Compare with previous version

In GitLab by @wilkb777 on Jan 29, 2021, 09:34

added 1 commit

  • a098b533 - fixing stupid mistake in CLI arg check

Compare with previous version

This is nice. We may also want to add another arg that can be used to pass customs args to snakemake command. For example, -n, --unlock, etc.

I would recommend adding set -euo pipefail (or similar) to catch some unexpected errors upfront.

In GitLab by @wilkb777 on Jan 29, 2021, 11:15

Commented on variant_annotation/src/run_pipeline.sh line 21

Let's make that a stretch goal. I get the benefits of it, but don't fully know the complications of implementing in bash right now. For now we can do this by hand with Snakemake if really necessary and that's good enough to get by for now. Does that work?

In GitLab by @wilkb777 on Jan 29, 2021, 11:16

added 1 commit

  • 2c93726a - minor updates based on recommendations, added some dataset info to readme

Compare with previous version

In GitLab by @wilkb777 on Jan 29, 2021, 11:16

Commented on variant_annotation/src/run_pipeline.sh line 7

good point, added it.

Sounds good.

As per discussion with @wilkb777, presence of dataset config in README is meant to serve only as an example of format required for this config and not as documentation for which datasets (or its versions) were used by the pipeline.

@wilkb777 - Feel free to add more info if needed. Closing this now :)

In GitLab by @wilkb777 on Jan 29, 2021, 12:05

marked the checklist item Snakemake pipeline as completed

In GitLab by @wilkb777 on Jan 29, 2021, 12:07

I added a section to the README now, give it a look and let me know if you think it's good enough for this

 into a single file, bgzipped and indexed.

In GitLab by @wilkb777 on Jan 29, 2021, 12:07

ok, the CLI has been added now so I'll resolve the thread.

In GitLab by @wilkb777 on Jan 29, 2021, 12:08

Commented on variant_annotation/README.md line 47

changed this line in version 11 of the diff

In GitLab by @wilkb777 on Jan 29, 2021, 12:08

added 1 commit

  • a25bf724 - Apply 1 suggestion(s) to 1 file(s)

Compare with previous version

resolved all threads

In GitLab by @wilkb777 on Jan 29, 2021, 12:24

marked the checklist item Documentation. I kept it simple; let me know if it needs more info. as completed

In GitLab by @wilkb777 on Jan 29, 2021, 12:24

approved this merge request

In GitLab by @wilkb777 on Jan 29, 2021, 12:24

marked this merge request as ready

In GitLab by @wilkb777 on Jan 29, 2021, 12:27

added 1 commit

  • 82a84ce1 - untracking unused file

Compare with previous version

In GitLab by @wilkb777 on Jan 29, 2021, 12:38

added 37 commits

  • 96dd259 - refactors snakefile to standardize a bit
  • abde556 - adds cluster config
  • 0a3416b - adds snakemake slurm profile as git submodule
  • 77e32c5 - adds snakemake slurm profile as submodule
  • 4fc35d5 - removes pipeline specific logs
  • 9a0de28 - updates snakemake command and resources
  • 2696afd - fixes filepaths
  • ffabcee - fix to avoid conda env conflict
  • 3188ac6 - bumps up ref version to use
  • 2b16f73 - updates gnomad fields
  • 0613055 - adds test vcf, thanks to Brandon
  • b16b88e - updates clinvar fields
  • 70b8b06 - changes case for readability
  • f8a6c79 - annotations with dbnfsp
  • d74d4f5 - adds more dbnsfp fields
  • 0bd2f7a - changes to improve speed
  • c67392b - fiexes minor bug
  • 6e29733 - cleans up dir structure and doc
  • 274ba37 - updates doc
  • 4053914 - adds annotated test vcf
  • f107d64 - fixes resources config bug
  • 635a5d1 - uses VEP's sift and polyphen; changes threads to 8
  • f9714ad - adds warning file
  • ddbc558 - breaks long strings to multi-lines
  • dc2ad99 - bumps cluster partition; update test output
  • de29e7b - bgzips output vcf
  • fb0d890 - updates conda env
  • ffdd618 - refactors vep to directly write bgzipped output file
  • dc79fdd - accepts input vcf via config
  • ba9add1 - adds test output
  • 6fae0cc - inputs datasets config via cli
  • 5bd9ab7 - downgrades cluster partition for annotation
  • eb16285 - removes unused code
  • c420c81 - making changes to add CLI specification of input info
  • 4b354a7 - fixing stupid mistake in CLI arg check
  • 92113ce - minor updates based on recommendations, added some dataset info to readme
  • 78a485c - Apply 1 suggestion(s) to 1 file(s)

Compare with previous version

mentioned in commit 624418a