How is `cid_full_length` assigned?
ejohnson643 opened this issue · comments
So I've been trying to understand how the _report.tsv
is generated from the _barcode_report.tsv
and the _cdr3.out
files in the trust-simplerep.pl
script and I believe that I have correctly determined that the CDR3s counted in the _report.tsv
are aggregated across barcodes using the V-D-J-C-CDRnt annotations as a unique key. This is all fine, but I was checking whether there were CDR3s in the report.tsv
that have the same V-D-J-C and CDR amino acid annotations, to see how different the different CDR3s in the file are. As an example:
So we can see that there are 5 different CDR3s with identical V, J, and CDRaa, but which differ in the exact nucleotide sequences. However, when I look at whether these CDR3s are "full length," only the first one is, despite them all having nearly identical CDR3 nucleotide sequences:
Could someone provide some insight into how exactly the "full length" determination is made and how it could show up differently in these elements of thereport.tsv
? Thank you!
The full length means the underlying contig contains the full-length receptor variable domain: 5' of V gene to the 3' of J gene. It is more strict than the complete CDR3.
Thanks for your help!
To make sure I understand: cid_full_length
is a property of the contig, not necessarily of the CDR3.
The idea behind indicating this information in the _report.tsv
is that it let's the user know whether the CDR3 was generated from a contig that contains the ends of the V and J gene?
Exactly.