VCF annotation using VEP - [merged]
ManavalanG opened this issue · comments
Merges vep_annotation -> master
Annotates variants in VCF using Variant Effect Predictor (VEP).
The following may need review at some level. Note this other stuff may need review as well, as this is not exhaustive.
- Snakemake pipeline
- Source code directory structure. It may be a good idea to spend some time on this so the repo with several moving parts will remain manageable.
- Documentation. I kept it simple; let me know if it needs more info.
- Datasets used and their versions
We currently obtain PolyPhen and SIFT scores vis dbNSFP. However VEP is capable of natively annotating them without extra work via options --sift, --polyphen
, and they recommend as much as well as dbNSFP includes only the non-synonymous variants.
Should we switch?
In GitLab by @wilkb777 on Jan 27, 2021, 15:35
yes, that's totally fine and I think we should use those options. The originating source for that info doesn't have to be dnNSFP, it just happens to be the most convenient way to obtain that info in a bulk download.
In GitLab by @wilkb777 on Jan 28, 2021, 09:14
Commented on variant_annotation/configs/env/vep.yaml line 7
- Is this particular version of BCFTools needed? I've been using
1.10.2
which has a big bug fix (https://github.com/samtools/bcftools/releases/tag/1.10.2) as well as the1.10
release has a bunch of bug fixes in it too.1.11
is available now as well but that mostly appears to be a feature enhancement release. - Is this particular version of tabix needed? BCFTools comes with tabix because it uses tabix indexes for some of its processing commands. I'd recommend using the accordingly packaged version of tabix to avoid issues unless absolutely needed.
changed this line in version 4 of the diff
added 7 commits
- 665db894 - fixes resources config bug
- c06137e4 - uses VEP's sift and polyphen; changes threads to 8
- 68723b88 - adds warning file
- 5b5bec96 - breaks long strings to multi-lines
- b737497d - bumps cluster partition; update test output
- d8368088 - bgzips output vcf
- 5a982790 - updates conda env
- bcftools
v1.11
has conflict with VEP and after noticing this, I just went back to my older setting which wasv1.9
. I testedv1.10.2
just now and it works without conflict. - Good catch. Checking it now, even VEP includes tabix, which I didn't realize as they recommend installing tabix separately in their documentation. Bioconda recipe probably just chose to include them I guess.
Changed them now.
Also made these changes recently:
- Bumped annotation to use 8 instead of 4 threads (#1)
- Made the output file bgzipped - as unzipped files are massive (~12GB) and they may be I/O expensive
- Upgraded cluster partition to use
- Style updates and minor improvements.
In GitLab by @wilkb777 on Jan 28, 2021, 12:29
Commented on variant_annotation/configs/env/vep.yaml line 7
Nice, I figured it was something like that but thought it was worth checking.
In GitLab by @wilkb777 on Jan 28, 2021, 12:29
resolved all threads
In GitLab by @wilkb777 on Jan 28, 2021, 17:05
Commented on variant_annotation/src/Snakefile line 136
| bcftools view -Oz \
switch to using bcftools to generate bgzip output
In GitLab by @wilkb777 on Jan 28, 2021, 17:05
Commented on variant_annotation/src/Snakefile line 137
remove as bcftools can do compression directly
In GitLab by @wilkb777 on Jan 28, 2021, 17:06
Commented on variant_annotation/src/Snakefile line 115
bcftools view {input.calls} | \
removing accidental parentheses
In GitLab by @wilkb777 on Jan 28, 2021, 17:06
Commented on variant_annotation/src/Snakefile line 138
> {output.calls}
removing accidental parentheses
In GitLab by @wilkb777 on Jan 28, 2021, 17:12
marked the checklist item Source code directory structure. It may be a good idea to spend some time on this so the repo with several moving parts will remain manageable. as completed
In GitLab by @wilkb777 on Jan 28, 2021, 17:12
marked the checklist item Datasets used and their versions as completed
In GitLab by @wilkb777 on Jan 28, 2021, 17:17
@ManavalanG I made #3 to remind us to consolidate the repo structure once major components are all merged.
In GitLab by @wilkb777 on Jan 28, 2021, 17:21
I think it'd be best to move datasets.yaml
out of this repo and make it a user config file with a hardcoded path like ~/.ditto_datasets.yaml
and then layout instructions on its format in the README. That way we won't be sharing any of our internal lab file structure when making the repo public.
In GitLab by @wilkb777 on Jan 28, 2021, 17:27
I'm torn on how much info about the custom datasets needs to be distributed. Specifically the custom formatting done for GERP and dbNSFP usage. I do not think we need to go crazy on that, but maybe just put a short description on how others could produce them. Thoughts @ManavalanG ?
In GitLab by @wilkb777 on Jan 28, 2021, 17:28
I guess adding version numbers of external datasets would be good to put in the README with this change so we know what version was used when and the expected file format.
In GitLab by @wilkb777 on Jan 28, 2021, 17:30
@ManavalanG I'm done with the initial review! I'll re-review when the update to the Snakefile is made to make the input VCF configurable and then ping again when I change the run script to handle commandline specification of input VCF and local vs slurm job execution.
changed this line in version 5 of the diff
changed this line in version 5 of the diff
changed this line in version 5 of the diff
changed this line in version 5 of the diff
added 2 commits
- 458bd8dc - refactors vep to directly write bgzipped output file
- 4d17bff0 - accepts input vcf via config
Switched to VEP writing output file directly.
Switched to VEP writing output file directly.
Removed.
This part is refactored now.
when the update to the Snakefile is made to make the input VCF configurable
This part is done now.
Yeah that sounds good to me.
- dbNSFP formatting we adopted is quite similar to what VEP (plugin) folks suggested. So I think we can just point to that.
- For GERP, we can just mention the command used for processing.
added 3 commits
- e18867d2 - adds test output
- e77ea1f0 - inputs datasets config via cli
- 5531cfd2 - downgrades cluster partition for annotation
I agree with the first part and I like the idea. Now it needs to be supplied to snakemake via CLI.
For the second part involving README, I'm conflicted. My worry is that it is easy to forget to update README as and when changes are made. But I do see the value of storing it somewhere. Any other way we can track this?
How about using symlinked files instead? We can store relevant info this way but don't have to remember to update readme.
In GitLab by @wilkb777 on Jan 29, 2021, 08:13
I'm an idiot and don't know why I put "we" in the above comment. I was thinking documenting the version with the formatting (i.e. just use default file versus custom format) on a basic level for making the repo public and making the work reproducible for others. I should've just combined this with the comment about custom datasets documentation.
In GitLab by @wilkb777 on Jan 29, 2021, 09:31
added 1 commit
- 7f5c3d3c - making changes to add CLI specification of input info
In GitLab by @wilkb777 on Jan 29, 2021, 09:34
added 1 commit
- a098b533 - fixing stupid mistake in CLI arg check
This is nice. We may also want to add another arg that can be used to pass customs args to snakemake command. For example, -n
, --unlock
, etc.
I would recommend adding set -euo pipefail
(or similar) to catch some unexpected errors upfront.
In GitLab by @wilkb777 on Jan 29, 2021, 11:15
Commented on variant_annotation/src/run_pipeline.sh line 21
Let's make that a stretch goal. I get the benefits of it, but don't fully know the complications of implementing in bash right now. For now we can do this by hand with Snakemake if really necessary and that's good enough to get by for now. Does that work?
In GitLab by @wilkb777 on Jan 29, 2021, 11:16
added 1 commit
- 2c93726a - minor updates based on recommendations, added some dataset info to readme
In GitLab by @wilkb777 on Jan 29, 2021, 11:16
Commented on variant_annotation/src/run_pipeline.sh line 7
good point, added it.
Sounds good.
As per discussion with @wilkb777, presence of dataset config in README is meant to serve only as an example of format required for this config and not as documentation for which datasets (or its versions) were used by the pipeline.
@wilkb777 - Feel free to add more info if needed. Closing this now :)
In GitLab by @wilkb777 on Jan 29, 2021, 12:05
marked the checklist item Snakemake pipeline as completed
In GitLab by @wilkb777 on Jan 29, 2021, 12:07
I added a section to the README now, give it a look and let me know if you think it's good enough for this
into a single file, bgzipped and indexed.
In GitLab by @wilkb777 on Jan 29, 2021, 12:07
ok, the CLI has been added now so I'll resolve the thread.
In GitLab by @wilkb777 on Jan 29, 2021, 12:08
Commented on variant_annotation/README.md line 47
changed this line in version 11 of the diff
In GitLab by @wilkb777 on Jan 29, 2021, 12:08
added 1 commit
- a25bf724 - Apply 1 suggestion(s) to 1 file(s)
LG2M
resolved all threads
In GitLab by @wilkb777 on Jan 29, 2021, 12:24
marked the checklist item Documentation. I kept it simple; let me know if it needs more info. as completed
In GitLab by @wilkb777 on Jan 29, 2021, 12:24
approved this merge request
In GitLab by @wilkb777 on Jan 29, 2021, 12:24
marked this merge request as ready
In GitLab by @wilkb777 on Jan 29, 2021, 12:38
added 37 commits
- 96dd259 - refactors snakefile to standardize a bit
- abde556 - adds cluster config
- 0a3416b - adds snakemake slurm profile as git submodule
- 77e32c5 - adds snakemake slurm profile as submodule
- 4fc35d5 - removes pipeline specific logs
- 9a0de28 - updates snakemake command and resources
- 2696afd - fixes filepaths
- ffabcee - fix to avoid conda env conflict
- 3188ac6 - bumps up ref version to use
- 2b16f73 - updates gnomad fields
- 0613055 - adds test vcf, thanks to Brandon
- b16b88e - updates clinvar fields
- 70b8b06 - changes case for readability
- f8a6c79 - annotations with dbnfsp
- d74d4f5 - adds more dbnsfp fields
- 0bd2f7a - changes to improve speed
- c67392b - fiexes minor bug
- 6e29733 - cleans up dir structure and doc
- 274ba37 - updates doc
- 4053914 - adds annotated test vcf
- f107d64 - fixes resources config bug
- 635a5d1 - uses VEP's sift and polyphen; changes threads to 8
- f9714ad - adds warning file
- ddbc558 - breaks long strings to multi-lines
- dc2ad99 - bumps cluster partition; update test output
- de29e7b - bgzips output vcf
- fb0d890 - updates conda env
- ffdd618 - refactors vep to directly write bgzipped output file
- dc79fdd - accepts input vcf via config
- ba9add1 - adds test output
- 6fae0cc - inputs datasets config via cli
- 5bd9ab7 - downgrades cluster partition for annotation
- eb16285 - removes unused code
- c420c81 - making changes to add CLI specification of input info
- 4b354a7 - fixing stupid mistake in CLI arg check
- 92113ce - minor updates based on recommendations, added some dataset info to readme
- 78a485c - Apply 1 suggestion(s) to 1 file(s)
mentioned in commit 624418a