How is `cid_full_length` assigned?

Question

How is `cid_full_length` assigned?

ejohnson643 opened this issue 10 months ago · comments

So I've been trying to understand how the _report.tsv is generated from the _barcode_report.tsv and the _cdr3.out files in the trust-simplerep.pl script and I believe that I have correctly determined that the CDR3s counted in the _report.tsv are aggregated across barcodes using the V-D-J-C-CDRnt annotations as a unique key. This is all fine, but I was checking whether there were CDR3s in the report.tsv that have the same V-D-J-C and CDR amino acid annotations, to see how different the different CDR3s in the file are. As an example:

So we can see that there are 5 different CDR3s with identical V, J, and CDRaa, but which differ in the exact nucleotide sequences. However, when I look at whether these CDR3s are "full length," only the first one is, despite them all having nearly identical CDR3 nucleotide sequences:

Could someone provide some insight into how exactly the "full length" determination is made and how it could show up differently in these elements of thereport.tsv? Thank you!

Li Song commented 10 months ago

Exactly.

Li Song · Answer 1 · Fri Oct 06 2023 02:03:16 GMT+0800 (China Standard Time)

The full length means the underlying contig contains the full-length receptor variable domain: 5' of V gene to the 3' of J gene. It is more strict than the complete CDR3.

Eric Johnson · Answer 2 · Fri Oct 06 2023 02:22:40 GMT+0800 (China Standard Time)

Thanks for your help!

To make sure I understand: cid_full_length is a property of the contig, not necessarily of the CDR3.

The idea behind indicating this information in the _report.tsv is that it let's the user know whether the CDR3 was generated from a contig that contains the ends of the V and J gene?