VCF annotation using VEP - [merged]

Question

VCF annotation using VEP - [merged]

ManavalanG opened this issue 4 years ago · comments

Manavalan Gajapathy commented 4 years ago

Merges vep_annotation -> master

Annotates variants in VCF using Variant Effect Predictor (VEP).

The following may need review at some level. Note this other stuff may need review as well, as this is not exhaustive.

Snakemake pipeline
Source code directory structure. It may be a good idea to spend some time on this so the repo with several moving parts will remain manageable.
Documentation. I kept it simple; let me know if it needs more info.
Datasets used and their versions

Manavalan Gajapathy commented 4 years ago

Removed.

Manavalan Gajapathy commented 4 years ago

LG2M

Manavalan Gajapathy · Answer 1 · Wed Jan 27 2021 00:31:15 GMT+0800 (China Standard Time)

added 1 commit

cc59824e - updates clinvar path

Compare with previous version

Manavalan Gajapathy · Answer 2 · Wed Jan 27 2021 02:36:52 GMT+0800 (China Standard Time)

added 1 commit

b18b4edf - adds annotated test vcf

Compare with previous version

Manavalan Gajapathy · Answer 3 · Wed Jan 27 2021 02:47:19 GMT+0800 (China Standard Time)

We currently obtain PolyPhen and SIFT scores vis dbNSFP. However VEP is capable of natively annotating them without extra work via options --sift, --polyphen, and they recommend as much as well as dbNSFP includes only the non-synonymous variants.

Should we switch?

Manavalan Gajapathy · Answer 4 · Thu Jan 28 2021 05:35:35 GMT+0800 (China Standard Time)

In GitLab by @wilkb777 on Jan 27, 2021, 15:35

yes, that's totally fine and I think we should use those options. The originating source for that info doesn't have to be dnNSFP, it just happens to be the most convenient way to obtain that info in a bulk download.

Manavalan Gajapathy · Answer 5 · Thu Jan 28 2021 23:14:16 GMT+0800 (China Standard Time)

In GitLab by @wilkb777 on Jan 28, 2021, 09:14

Commented on variant_annotation/configs/env/vep.yaml line 7

Is this particular version of BCFTools needed? I've been using 1.10.2 which has a big bug fix (https://github.com/samtools/bcftools/releases/tag/1.10.2) as well as the 1.10 release has a bunch of bug fixes in it too. 1.11 is available now as well but that mostly appears to be a feature enhancement release.
Is this particular version of tabix needed? BCFTools comes with tabix because it uses tabix indexes for some of its processing commands. I'd recommend using the accordingly packaged version of tabix to avoid issues unless absolutely needed.

Manavalan Gajapathy · Answer 6 · Fri Jan 29 2021 00:10:37 GMT+0800 (China Standard Time)

changed this line in version 4 of the diff

Manavalan Gajapathy · Answer 7 · Fri Jan 29 2021 00:10:38 GMT+0800 (China Standard Time)

added 7 commits

665db894 - fixes resources config bug
c06137e4 - uses VEP's sift and polyphen; changes threads to 8
68723b88 - adds warning file
5b5bec96 - breaks long strings to multi-lines
b737497d - bumps cluster partition; update test output
d8368088 - bgzips output vcf
5a982790 - updates conda env

Compare with previous version

Manavalan Gajapathy · Answer 8 · Fri Jan 29 2021 00:11:33 GMT+0800 (China Standard Time)

bcftools v1.11 has conflict with VEP and after noticing this, I just went back to my older setting which was v1.9. I tested v1.10.2 just now and it works without conflict.
Good catch. Checking it now, even VEP includes tabix, which I didn't realize as they recommend installing tabix separately in their documentation. Bioconda recipe probably just chose to include them I guess.

Manavalan Gajapathy · Answer 9 · Fri Jan 29 2021 00:13:45 GMT+0800 (China Standard Time)

Changed them now.

Manavalan Gajapathy · Answer 10 · Fri Jan 29 2021 02:25:27 GMT+0800 (China Standard Time)

Also made these changes recently:

Bumped annotation to use 8 instead of 4 threads (#1)
Made the output file bgzipped - as unzipped files are massive (~12GB) and they may be I/O expensive
Upgraded cluster partition to use
Style updates and minor improvements.

Manavalan Gajapathy · Answer 11 · Fri Jan 29 2021 02:29:42 GMT+0800 (China Standard Time)

In GitLab by @wilkb777 on Jan 28, 2021, 12:29

Commented on variant_annotation/configs/env/vep.yaml line 7

Nice, I figured it was something like that but thought it was worth checking.

Manavalan Gajapathy · Answer 12 · Fri Jan 29 2021 02:29:42 GMT+0800 (China Standard Time)

In GitLab by @wilkb777 on Jan 28, 2021, 12:29

resolved all threads

Manavalan Gajapathy · Answer 13 · Fri Jan 29 2021 07:05:08 GMT+0800 (China Standard Time)

In GitLab by @wilkb777 on Jan 28, 2021, 17:05

Commented on variant_annotation/src/Snakefile line 136

            | bcftools view -Oz \

switch to using bcftools to generate bgzip output

Manavalan Gajapathy · Answer 14 · Fri Jan 29 2021 07:05:49 GMT+0800 (China Standard Time)

In GitLab by @wilkb777 on Jan 28, 2021, 17:05

Commented on variant_annotation/src/Snakefile line 137

remove as bcftools can do compression directly

Manavalan Gajapathy · Answer 15 · Fri Jan 29 2021 07:06:24 GMT+0800 (China Standard Time)

In GitLab by @wilkb777 on Jan 28, 2021, 17:06

Commented on variant_annotation/src/Snakefile line 115

        bcftools view {input.calls} | \

removing accidental parentheses

Manavalan Gajapathy · Answer 16 · Fri Jan 29 2021 07:06:46 GMT+0800 (China Standard Time)

In GitLab by @wilkb777 on Jan 28, 2021, 17:06

Commented on variant_annotation/src/Snakefile line 138

            > {output.calls}

removing accidental parentheses

Manavalan Gajapathy · Answer 17 · Fri Jan 29 2021 07:12:27 GMT+0800 (China Standard Time)

In GitLab by @wilkb777 on Jan 28, 2021, 17:12

marked the checklist item Source code directory structure. It may be a good idea to spend some time on this so the repo with several moving parts will remain manageable. as completed

Manavalan Gajapathy · Answer 18 · Fri Jan 29 2021 07:12:49 GMT+0800 (China Standard Time)

In GitLab by @wilkb777 on Jan 28, 2021, 17:12

marked the checklist item Datasets used and their versions as completed

Manavalan Gajapathy · Answer 19 · Fri Jan 29 2021 07:17:39 GMT+0800 (China Standard Time)

In GitLab by @wilkb777 on Jan 28, 2021, 17:17

@ManavalanG I made #3 to remind us to consolidate the repo structure once major components are all merged.

Manavalan Gajapathy · Answer 20 · Fri Jan 29 2021 07:21:13 GMT+0800 (China Standard Time)

In GitLab by @wilkb777 on Jan 28, 2021, 17:21

I think it'd be best to move datasets.yaml out of this repo and make it a user config file with a hardcoded path like ~/.ditto_datasets.yaml and then layout instructions on its format in the README. That way we won't be sharing any of our internal lab file structure when making the repo public.

Manavalan Gajapathy · Answer 21 · Fri Jan 29 2021 07:27:42 GMT+0800 (China Standard Time)

In GitLab by @wilkb777 on Jan 28, 2021, 17:27

I'm torn on how much info about the custom datasets needs to be distributed. Specifically the custom formatting done for GERP and dbNSFP usage. I do not think we need to go crazy on that, but maybe just put a short description on how others could produce them. Thoughts @ManavalanG ?

Manavalan Gajapathy · Answer 22 · Fri Jan 29 2021 07:28:34 GMT+0800 (China Standard Time)

In GitLab by @wilkb777 on Jan 28, 2021, 17:28

I guess adding version numbers of external datasets would be good to put in the README with this change so we know what version was used when and the expected file format.

Manavalan Gajapathy · Answer 23 · Fri Jan 29 2021 07:30:40 GMT+0800 (China Standard Time)

In GitLab by @wilkb777 on Jan 28, 2021, 17:30

@ManavalanG I'm done with the initial review! I'll re-review when the update to the Snakefile is made to make the input VCF configurable and then ping again when I change the run script to handle commandline specification of input VCF and local vs slurm job execution.

Manavalan Gajapathy · Answer 24 · Fri Jan 29 2021 10:35:55 GMT+0800 (China Standard Time)

changed this line in version 5 of the diff

Manavalan Gajapathy · Answer 25 · Fri Jan 29 2021 10:35:56 GMT+0800 (China Standard Time)

changed this line in version 5 of the diff

Manavalan Gajapathy · Answer 26 · Fri Jan 29 2021 10:35:56 GMT+0800 (China Standard Time)

changed this line in version 5 of the diff

Manavalan Gajapathy · Answer 27 · Fri Jan 29 2021 10:35:56 GMT+0800 (China Standard Time)

changed this line in version 5 of the diff

Manavalan Gajapathy · Answer 28 · Fri Jan 29 2021 10:35:56 GMT+0800 (China Standard Time)

added 2 commits

458bd8dc - refactors vep to directly write bgzipped output file
4d17bff0 - accepts input vcf via config

Compare with previous version

Manavalan Gajapathy · Answer 29 · Fri Jan 29 2021 10:36:51 GMT+0800 (China Standard Time)

Switched to VEP writing output file directly.

Manavalan Gajapathy · Answer 30 · Fri Jan 29 2021 10:36:58 GMT+0800 (China Standard Time)

Switched to VEP writing output file directly.

Manavalan Gajapathy · Answer 31 · Fri Jan 29 2021 10:37:17 GMT+0800 (China Standard Time)

This part is refactored now.

Manavalan Gajapathy · Answer 32 · Fri Jan 29 2021 10:40:30 GMT+0800 (China Standard Time)

when the update to the Snakefile is made to make the input VCF configurable

This part is done now.

Manavalan Gajapathy · Answer 33 · Fri Jan 29 2021 10:44:31 GMT+0800 (China Standard Time)

Yeah that sounds good to me.

dbNSFP formatting we adopted is quite similar to what VEP (plugin) folks suggested. So I think we can just point to that.
For GERP, we can just mention the command used for processing.

Manavalan Gajapathy · Answer 34 · Fri Jan 29 2021 10:58:15 GMT+0800 (China Standard Time)

added 3 commits

e18867d2 - adds test output
e77ea1f0 - inputs datasets config via cli
5531cfd2 - downgrades cluster partition for annotation

Compare with previous version

Manavalan Gajapathy · Answer 35 · Fri Jan 29 2021 11:01:00 GMT+0800 (China Standard Time)

I agree with the first part and I like the idea. Now it needs to be supplied to snakemake via CLI.

For the second part involving README, I'm conflicted. My worry is that it is easy to forget to update README as and when changes are made. But I do see the value of storing it somewhere. Any other way we can track this?

Manavalan Gajapathy · Answer 36 · Fri Jan 29 2021 11:09:50 GMT+0800 (China Standard Time)

added 1 commit

a5d130ce - removes unused code

Compare with previous version

Manavalan Gajapathy · Answer 37 · Fri Jan 29 2021 21:27:43 GMT+0800 (China Standard Time)

How about using symlinked files instead? We can store relevant info this way but don't have to remember to update readme.

Manavalan Gajapathy · Answer 38 · Fri Jan 29 2021 22:13:02 GMT+0800 (China Standard Time)

In GitLab by @wilkb777 on Jan 29, 2021, 08:13

I'm an idiot and don't know why I put "we" in the above comment. I was thinking documenting the version with the formatting (i.e. just use default file versus custom format) on a basic level for making the repo public and making the work reproducible for others. I should've just combined this with the comment about custom datasets documentation.

Manavalan Gajapathy · Answer 39 · Fri Jan 29 2021 23:31:06 GMT+0800 (China Standard Time)

In GitLab by @wilkb777 on Jan 29, 2021, 09:31

added 1 commit

7f5c3d3c - making changes to add CLI specification of input info

Compare with previous version

Manavalan Gajapathy · Answer 40 · Fri Jan 29 2021 23:34:03 GMT+0800 (China Standard Time)

In GitLab by @wilkb777 on Jan 29, 2021, 09:34

added 1 commit

a098b533 - fixing stupid mistake in CLI arg check

Compare with previous version

Manavalan Gajapathy · Answer 41 · Sat Jan 30 2021 00:16:36 GMT+0800 (China Standard Time)

This is nice. We may also want to add another arg that can be used to pass customs args to snakemake command. For example, -n, --unlock, etc.

Manavalan Gajapathy · Answer 42 · Sat Jan 30 2021 00:26:21 GMT+0800 (China Standard Time)

I would recommend adding set -euo pipefail (or similar) to catch some unexpected errors upfront.

Manavalan Gajapathy · Answer 43 · Sat Jan 30 2021 01:15:47 GMT+0800 (China Standard Time)

In GitLab by @wilkb777 on Jan 29, 2021, 11:15

Commented on variant_annotation/src/run_pipeline.sh line 21

Let's make that a stretch goal. I get the benefits of it, but don't fully know the complications of implementing in bash right now. For now we can do this by hand with Snakemake if really necessary and that's good enough to get by for now. Does that work?

Manavalan Gajapathy · Answer 44 · Sat Jan 30 2021 01:16:26 GMT+0800 (China Standard Time)

In GitLab by @wilkb777 on Jan 29, 2021, 11:16

added 1 commit

2c93726a - minor updates based on recommendations, added some dataset info to readme

Compare with previous version

Manavalan Gajapathy · Answer 45 · Sat Jan 30 2021 01:16:44 GMT+0800 (China Standard Time)

In GitLab by @wilkb777 on Jan 29, 2021, 11:16

Commented on variant_annotation/src/run_pipeline.sh line 7

good point, added it.

Manavalan Gajapathy · Answer 46 · Sat Jan 30 2021 01:21:12 GMT+0800 (China Standard Time)

Sounds good.

Manavalan Gajapathy · Answer 47 · Sat Jan 30 2021 02:05:50 GMT+0800 (China Standard Time)

As per discussion with @wilkb777, presence of dataset config in README is meant to serve only as an example of format required for this config and not as documentation for which datasets (or its versions) were used by the pipeline.

@wilkb777 - Feel free to add more info if needed. Closing this now :)

Manavalan Gajapathy · Answer 48 · Sat Jan 30 2021 02:05:54 GMT+0800 (China Standard Time)

In GitLab by @wilkb777 on Jan 29, 2021, 12:05

marked the checklist item Snakemake pipeline as completed

Manavalan Gajapathy · Answer 49 · Sat Jan 30 2021 02:07:10 GMT+0800 (China Standard Time)

In GitLab by @wilkb777 on Jan 29, 2021, 12:07

I added a section to the README now, give it a look and let me know if you think it's good enough for this

Manavalan Gajapathy · Answer 50 · Sat Jan 30 2021 02:07:41 GMT+0800 (China Standard Time)

 into a single file, bgzipped and indexed.

Manavalan Gajapathy · Answer 51 · Sat Jan 30 2021 02:07:51 GMT+0800 (China Standard Time)

In GitLab by @wilkb777 on Jan 29, 2021, 12:07

ok, the CLI has been added now so I'll resolve the thread.

Manavalan Gajapathy · Answer 52 · Sat Jan 30 2021 02:08:05 GMT+0800 (China Standard Time)

In GitLab by @wilkb777 on Jan 29, 2021, 12:08

Commented on variant_annotation/README.md line 47

changed this line in version 11 of the diff

Manavalan Gajapathy · Answer 53 · Sat Jan 30 2021 02:08:05 GMT+0800 (China Standard Time)

In GitLab by @wilkb777 on Jan 29, 2021, 12:08

added 1 commit

a25bf724 - Apply 1 suggestion(s) to 1 file(s)

Compare with previous version

Manavalan Gajapathy · Answer 54 · Sat Jan 30 2021 02:11:52 GMT+0800 (China Standard Time)

resolved all threads

Manavalan Gajapathy · Answer 55 · Sat Jan 30 2021 02:24:21 GMT+0800 (China Standard Time)

In GitLab by @wilkb777 on Jan 29, 2021, 12:24

marked the checklist item Documentation. I kept it simple; let me know if it needs more info. as completed

Manavalan Gajapathy · Answer 56 · Sat Jan 30 2021 02:24:28 GMT+0800 (China Standard Time)

In GitLab by @wilkb777 on Jan 29, 2021, 12:24

approved this merge request

Manavalan Gajapathy · Answer 57 · Sat Jan 30 2021 02:24:29 GMT+0800 (China Standard Time)

In GitLab by @wilkb777 on Jan 29, 2021, 12:24

marked this merge request as ready

Manavalan Gajapathy · Answer 58 · Sat Jan 30 2021 02:27:40 GMT+0800 (China Standard Time)

In GitLab by @wilkb777 on Jan 29, 2021, 12:27

added 1 commit

82a84ce1 - untracking unused file

Compare with previous version

Manavalan Gajapathy · Answer 59 · Sat Jan 30 2021 02:38:46 GMT+0800 (China Standard Time)

In GitLab by @wilkb777 on Jan 29, 2021, 12:38

added 37 commits

96dd259 - refactors snakefile to standardize a bit
abde556 - adds cluster config
0a3416b - adds snakemake slurm profile as git submodule
77e32c5 - adds snakemake slurm profile as submodule
4fc35d5 - removes pipeline specific logs
9a0de28 - updates snakemake command and resources
2696afd - fixes filepaths
ffabcee - fix to avoid conda env conflict
3188ac6 - bumps up ref version to use
2b16f73 - updates gnomad fields
0613055 - adds test vcf, thanks to Brandon
b16b88e - updates clinvar fields
70b8b06 - changes case for readability
f8a6c79 - annotations with dbnfsp
d74d4f5 - adds more dbnsfp fields
0bd2f7a - changes to improve speed
c67392b - fiexes minor bug
6e29733 - cleans up dir structure and doc
274ba37 - updates doc
4053914 - adds annotated test vcf
f107d64 - fixes resources config bug
635a5d1 - uses VEP's sift and polyphen; changes threads to 8
f9714ad - adds warning file
ddbc558 - breaks long strings to multi-lines
dc2ad99 - bumps cluster partition; update test output
de29e7b - bgzips output vcf
fb0d890 - updates conda env
ffdd618 - refactors vep to directly write bgzipped output file
dc79fdd - accepts input vcf via config
ba9add1 - adds test output
6fae0cc - inputs datasets config via cli
5bd9ab7 - downgrades cluster partition for annotation
eb16285 - removes unused code
c420c81 - making changes to add CLI specification of input info
4b354a7 - fixing stupid mistake in CLI arg check
92113ce - minor updates based on recommendations, added some dataset info to readme
78a485c - Apply 1 suggestion(s) to 1 file(s)

Compare with previous version

Manavalan Gajapathy · Answer 60 · Sat Jan 30 2021 02:56:41 GMT+0800 (China Standard Time)

mentioned in commit 624418a