How are CDR3 scores calculated?
ejohnson643 opened this issue · comments
I have a quick question about the definitions of the various "scores" indicated in the different file types. In particular, in the _cdr3.out
file, there is a column called CDR3_score
and another called CDR3_germline_similarity
. I was checking how these are related and I found that the CDR3_score
is highly discretized:
>> cdr3['CDR3_score'].value_counts()
CDR3_score
0.00 87094
1.00 28042
0.83 15706
0.67 14229
0.01 7305
0.50 6924
0.33 297
0.17 33
Name: count, dtype: int64
It only appears to take on values of 0, 0.01, 1/6, 1/3, 1/2, 2/3, 5/6, 1. Is this expected?
On the other hand, I'm assuming that CDR3_germline_similarity
is the Hamming distance to the reference or something?
Thanks for your help!
-
For the scores, 0 means partial CDR3, 0.01 means CDR3 contains imputation sequence (based on the germline TCR sequence), 0.5 is rescued CDR3 sequence like due to very short anchor on V,J genes, the other numbers are x/6, where x is the number of satisfied motifs around CDR3 (YYC at 5', F/WGxG at the 3' end).
-
The germline similarity is the fraction of match in the edit distance to the overlapped germline V and J gene within the CDR3 region.
Thank you very much!