Understanding of the output files

Question

Understanding of the output files

yuyuleung opened this issue 5 months ago · comments

Dear Dr. Li,

Firstly, I would like to express my gratitude for providing such an efficient immune assembly tool. I have a few questions regarding the output files generated after running trust4 (Therefore, I have created a new issue):

Based on the order of the output files, I understand that "to_assemble.fq" is generated first, followed by "assemble_reads.fq." However, I am unsure about the differences between these two files. Which file contains the reads ready for assembly, specifically those aligned to reference genes and retaining the unmapped reads with CDR3 motifs (as I have understood the logic of trust4)?
While examining the read counts in the statistics, I noticed that the "read fragment count" in "cdr3.out" is reported in decimal form. I am curious about the reason for this, especially considering that the similar statistic in ".report" is reported as a whole number. Additionally, I would like to confirm my understanding of this statistic. Does this number represent the count of reads used to assemble the full sequence of the corresponding consensus (e.g., "assemble0"), or does it only refer to the coverage of the CDR3 region?
Regarding the statistics in "anno.fa," I would like to understand how the third number, "average_coverage," is calculated. Is it the count of reads used to assemble the corresponding consensus divided by the consensus length?

Thank you in advance for your clarification and assistance.

Sincerely,
Yuyu

YuyuLiang · Answer 1 · Thu Feb 22 2024 17:57:59 GMT+0800 (China Standard Time)

By the way, these output files I have mentioned were generated in the bulk mode. Thanks again.

Li Song · Answer 2 · Thu Feb 22 2024 23:42:40 GMT+0800 (China Standard Time)

to_assemble is a very rough prediction of whether a read could be a candidate reads from VDJ region. It contains many non-VDJ reads. The assembled reads are the reads actually used in the assembly. It contains both CDR3 reads, and reads that may full contained in the V gene region or C gene region.
Some read can be ambiguous assigned to multiple CDR3, so TRUST4 apply the EM algorithm to better estimate the CDR3 abundances. It is the number of reads supporting this CDR3, not necessarily the reads for assemble the full sequence.
The average_coverage is the (sum_{read} read_length)/500, where here the read is the read for assemble the contig. It divides 500 instead of the actual contig is to reduce the coverage overestimation bias for short contigs.

Hope this helps.

YuyuLiang · Answer 3 · Fri Feb 23 2024 09:28:49 GMT+0800 (China Standard Time)

Dear Prof. Li,

Thank you so much for your detailed explantion. They are so helpful to me :).

Best wishes,
Yuyu