How are CDR3 scores calculated?

Question

How are CDR3 scores calculated?

ejohnson643 opened this issue 10 months ago · comments

I have a quick question about the definitions of the various "scores" indicated in the different file types. In particular, in the _cdr3.out file, there is a column called CDR3_score and another called CDR3_germline_similarity. I was checking how these are related and I found that the CDR3_score is highly discretized:

>> cdr3['CDR3_score'].value_counts()
CDR3_score
0.00    87094
1.00    28042
0.83    15706
0.67    14229
0.01     7305
0.50     6924
0.33      297
0.17       33
Name: count, dtype: int64

It only appears to take on values of 0, 0.01, 1/6, 1/3, 1/2, 2/3, 5/6, 1. Is this expected?

On the other hand, I'm assuming that CDR3_germline_similarity is the Hamming distance to the reference or something?

Thanks for your help!

Li Song · Answer 1 · Wed Oct 04 2023 04:08:14 GMT+0800 (China Standard Time)

For the scores, 0 means partial CDR3, 0.01 means CDR3 contains imputation sequence (based on the germline TCR sequence), 0.5 is rescued CDR3 sequence like due to very short anchor on V,J genes, the other numbers are x/6, where x is the number of satisfied motifs around CDR3 (YYC at 5', F/WGxG at the 3' end).
The germline similarity is the fraction of match in the edit distance to the overlapped germline V and J gene within the CDR3 region.

Eric Johnson · Answer 2 · Wed Oct 04 2023 23:00:42 GMT+0800 (China Standard Time)

Thank you very much!