quinlan-lab / ccrhtml

A small repo for storing the code for making the files and html for CCRs.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Question on the bed files

Bin-Guan opened this issue · comments

Hi, I used the following variants in KCNQ2 to test the CCR bed file for annotation, and found that the bed regions do not make sense. The tested variants are the three overlapping a nucleotide with CCR percentile 0 according to the browser (https://s3.us-east-2.amazonaws.com/ccrs/ccr.html). I expect the first line 20-62076114-C-A to be 0 CCR percentile, since it's synonymous. I suspect the bed file coordinates are off by 1? Thanks.
20 62076114 KCNQ2:p.A196A C A . . ccr_pct=99.7099;fathmm_xf_coding=0.008712
20 62076115 KCNQ2:p.A196V G A . . ccr_pct=0;ClinPred_Score=0.99;fathmm_xf_coding=0.9454
20 62076116 KCNQ2:p.A196S C A . . ccr_pct=67.1065;ClinPred_Score=0.9841;fathmm_xf_coding=0.9445

The CCR bed file looks like this:
20 62076011 62076114 99.709937837 KCNQ2 62073863-62073884,62076011-62076114 ....
20 62076114 62076115 0.000000000 KCNQ2 62076114-62076115 ...
20 62076115 62076130 67.106472851 KCNQ2 62076115-62076130 ...

Thanks for your reply. My question is whether the coordinates in the bed file is off by 1. In the examples I gave above, if there is a nucleotide to be ccr_pct = 0 among the three nucleotides chr20:62076113-62076115, it has to be 62076114 because this is the 3rd nucleotide in the codon 196. Thus, I think the bed file should be like this:
20 62076113 62076114 0.000000000 KCNQ2 62076114-62076115 ...

Even you flatten all KCNQ2 transcripts, coordinate 114's ccr_pct should still be 0, which would also match GERP+ as shown in your CCR browser https://s3.us-east-2.amazonaws.com/ccrs/ccr.html. The GERP++ score at 114 is negative, and GERP++ score at 115 is high. If you scroll to other regions, I bet you'll also find that ccr_pcts are off 1 nt comparing to GERP++.

I am not sure I see what you are trying to say anymore. What makes you say it would be 0? The CCR is correct. It goes from 012-114 as seen here below:
image
CCR at 115 corresponds with a passing gnomAD variant (you can check that on the browser) which is in 1-based VCF format which gives it a 0:
image
image
The GERP file is base-by-base with each single base having a value. Not sure what relevance GERP has though...as it has nothing to do with CCR or its calculation. They do not agree in all bases, as GERP is interspecies conservation-based and CCRs are based on human-specific variant constraint.

If 115 is 0, then 114 cannot be 99.7. I've been trying to remind you the same thing using different reasoning: a nucleotide position at the 3rd position in a codon, GERP score, or another predictor such as ClinPred. I have no doubt that the ccr matches constraint regions when you look from 1000 ft away. But I'm certain that the boundary is off 1 nt. The boundary has to be in agreement with basic biology concept and with other tools.

The basic biological concept of a protein-altering variant within the coding sequence of several KCNQ2 transcripts is shown below...
image These are the only things that define a CCR boundary, the beginning or end of a gene's coding sequence from GRCh37, or a protein-altering variant in gnomAD version 2.0.1. CCR ends where that variant begins, at the BED file coordinate of 114-115. It is not off by 1 nt, that is where the 0 pct CCR and variant is, 114-115 with the variant being a simple missense variant that you can check here https://www.ncbi.nlm.nih.gov/snp/rs118192199. The next CCR starts right after this variant at 115 (BED coordinate) ending at 130, because the next variant is at 130-131 in BED format, 131 in VCF format.
This is why I keep trying to explain that important distinction between 0-based half-open BED vs 1-based VCF format. This is not 1000ft away, these are precise, single nucleotide boundaries.

Bin, I think you may be getting confused by the different ways of representing positions. As Jim mentioned BED is "0-based half-open." For further reading see:

https://www.biostars.org/p/7041/
http://genome.ucsc.edu/FAQ/FAQformat.html#format1

(I think this is how python, for example, indexes a vector or list)

VCF is 1-based:

http://samtools.github.io/hts-specs/VCFv4.2.pdf (see 1.4.1 "POS").

(This how R indexes a vector)

If you interconvert between BED and VCF you need to transpose the start position by one.

I actually studied this cheat sheet on 0- or 1- before asking the question: https://www.biostars.org/p/84686/. So I don't think the question is 0-based or 1-based.
Here is a probable explanation: CCR only looks at missense variant and totally ignores synonymous variant, thus CCR considers position 114 (1-based) ccr_pct to be 99.7. Thus, position 114 is "constrained" based on gnomAD missense dataset, although gnomAD has a synonymous variant at this position and it has a low GERP++. Position 114 is not constrained in light of being not conserved during evolution and the synonymous change in gnomAD. This would happen to almost all positions when transitioning between the 3rd and the 2nd nt in a codon. Thus, I saw the 1-shift in the ccr browser. A better constrained matrix probably should consider both missense and synonymous changes, and only regions void of synonymous changes is bona fide constrained regions. In practice, the shift probably will not matter because very few people are interested in synonymous changes.
Thanks Jim and David for the discussions.

The discussion is not a problem. If there's nothing else, feel free to close the issue, as this is a repository for the browser, not the model's design.