liulab-dfci / TRUST4

TCR and BCR assembly from RNA-seq data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How are CDR3 scores calculated?

ejohnson643 opened this issue · comments

I have a quick question about the definitions of the various "scores" indicated in the different file types. In particular, in the _cdr3.out file, there is a column called CDR3_score and another called CDR3_germline_similarity. I was checking how these are related and I found that the CDR3_score is highly discretized:

>> cdr3['CDR3_score'].value_counts()
CDR3_score
0.00    87094
1.00    28042
0.83    15706
0.67    14229
0.01     7305
0.50     6924
0.33      297
0.17       33
Name: count, dtype: int64

It only appears to take on values of 0, 0.01, 1/6, 1/3, 1/2, 2/3, 5/6, 1. Is this expected?

On the other hand, I'm assuming that CDR3_germline_similarity is the Hamming distance to the reference or something?

Thanks for your help!

  1. For the scores, 0 means partial CDR3, 0.01 means CDR3 contains imputation sequence (based on the germline TCR sequence), 0.5 is rescued CDR3 sequence like due to very short anchor on V,J genes, the other numbers are x/6, where x is the number of satisfied motifs around CDR3 (YYC at 5', F/WGxG at the 3' end).

  2. The germline similarity is the fraction of match in the edit distance to the overlapped germline V and J gene within the CDR3 region.

Thank you very much!