liulab-dfci / TRUST4

TCR and BCR assembly from RNA-seq data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

almost same CDR3 nucleotides from different assemblies

yuyuleung opened this issue · comments

Hello,

I have attempted to use TRUST4 to assemble my SE RNAseq data and have learned the basics of how it works. However, I am still confused about the final report.

In the .report file, I noticed that the top three CDR3 sequences have almost the same nucleotide sequence, with only a few differences in the last few 'T's. What confuses me is that these sequences were assembled separately, and the top-ranked assembly has an out-of-frame CDR3 amino acid sequence. I think it would make more sense to assemble these reads/contigs during the assembly instead of separating them. Have I misunderstood?

Can you please clarify my doubts? Thank you so much!

Best wishes,
Yuyu
4ef6ba94750f1de99048cbb99cae437
test_report.txt

Thank you for sharing the issue. Since this is an insertion/deletion difference, TRUST4 will separate them into different contigs.

Hello, thank you for your message. I have attempted to realign these top three assembled contigs to the reference using Mixcr. Their results indicate that the other two contigs (assemble 00 and 16) can be translated with specific insertions or deletions (different from the CDR3aa of the assemble 80). I am currently unsure about how to handle these two different translation methods, as I plan to utilize the CDR3aa sequence for my subsequent study. Dose TRUST4 consider more biological issues about these inserts/deletions?

I would greatly value any official advice you can offer me on this matter.

Best wishes,
Yuyu

assemble00 assemble16

The inserted/deleted nucleotide would make it out of frame, and this is a non-functional VDJ recombination. Even though we can assign some translation for part of the sequences, there will still be an untranslatable place in the sequence. For example, you can see the "_" in the translated sequence for the assemble 0 and 16. In practice, you can ignore the non-functional CDR3s, i.e. out of frame or amino acid sequence with a stop codon in it.

If you are concerned about the out_of_frame assembly getting the highest abundance (which is indeed strange), you can share the "toassemble.fq" files with me. I can look into the issue. Thanks.

Hello, thank you for your explanation. I am indeed confused about the high abundance of the first non-functional assembled consensus. I would be grateful if you could assist me in investigating this issue. I have attached the corresponding "toassemble.fq" file for your reference. Your patience is greatly appreciated.

V350112705_L03_toassemble.zip

Best wishes,
Yuyu

Thank you for sharing the file. I tested TRUST4 and observed the similar results. I checked the reverse complement sequence of "CCAGAAGCTGGTTTTTGGCCAGGGG" which spans the joining region of the assemble0's CDR3 and J gene sequences, and there are 26277 reads containing this pattern. On the other hand, for the sequence "CCAGAAGCTGGTTTTGGCCAGGGG" corresponding to assemble8, only 761 reads contained it. Therefore, the out_of_frame assemble0 might be more real based on your data.

Just want to confirm, which species is your data? Is it mouse? it is strange that the assembled TRAV has only 97.83% identity as the germline V gene. Furthermore, why only TRA is detected in this data? Is it some SMART-seq single-cell data? Thank you.

Thank you for the prompt response. I now comprehend the unusual output.

To address your inquiry, the data I am working with is mouse scRNA data. By using a TRA-specific primer, most of the amplified reads correspond to TRA genes. Consequently, nearly all TRA genes were detected in the analysis.

However, I am still uncertain about your statement regarding the assembled TRAV gene having only 97.83% identity to the germline V gene. Since I am relatively new to the field of Immunology, I am curious to know what level of similarity is expected.

Once again, thank you for your invaluable assistance!

Best wishes,
Yuyu

In my experience, the identity is usually above 99% for TCR chains. But if your mouse is from a certain strain, I guess 97.83% is reasonable, and it's not bad.

One more thing you can check, does your data have UMI information? It could be that the high count for out_of_frame is from the PCR artifact, and counts based on UMI can alleviate this bias.

For other cells, do you usually observe two types of TRA CDR3s? It could be that the kit did not capture the other functional TRA from this cell.

Thank you so much for your invaluable suggestion! Based on your advice, I decide to remove the duplicative reads from my data before the assembly process, since my data does not contain UMI information. After this step, I will examine the assembly result once more.

Thanks a lot!

Best wishes.
Yuyu