liulab-dfci / TRUST4

TCR and BCR assembly from RNA-seq data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Different outputs of 10x scData-TCR between MiXCR and TRUST4

yuyuleung opened this issue · comments

Dear Dr. Li,

Firstly, I would like to express my gratitude for creating such excellent tools.

I have utilized TRUST4 to analyze my different spatial TCR data set (TRA and TRB are enriched separatly) these days. However, I have observed that the assembly result for TRB is different from that of TRA. For several data set, despite having a greater number of VDJ-reads for TRB, there are consistently fewer spots with a complete VDJ for TRB compared to TRA. To ensure that this discrepancy is not specific to my data, I conducted an analysis on a 10x demo dataset. Interestingly, I discovered that there were significantly fewer cells with complete-vdj TRB compared to TRA, and the overall number of cells with complete-VDJ TCR was also much lower than the results reported by 10x analyzed by MiXCR for the same 10x dataset. This prompts me to inquire about potential differences in the assembly process between TRA and TRB.

Additionally, I am curious to understand the definition of a complete VDJ according to TRUST4. In the case of 10x, the definition of VJ-spanning is described as the contig annotation extending from the 5' end of the V gene to the 3' end of the J gene. Are these definitions distinct from each other?

I am also not sure whether I have analyzed them correctly, cause they really depend on how I understand the output of TRUST4, or whether there are details I should concern about.

Thank you so much for your attention.
Best wishes,
Yuyu

I would like to show you the 10x data I have used and the analysis result I have summarized following:

Link of data: https://www.10xgenomics.com/cn/datasets/t-cells-from-bal-bc-mice-1-k-cells-multi-v-2-2-standard-5-0-0
10x report of such data: https://cf.10xgenomics.com/samples/cell-vdj/5.0.0/sc5p_v2_mm_balbc_T_1k_multi_5gex_t/sc5p_v2_mm_balbc_T_1k_multi_5gex_t_web_summary.html
TRUST4 (partial) result of such data:
trsut4_res.zip
bash of run-TRUST4:
run_trust4.txt
The version of TRUST4: v0.1.1
I have also summarized my comparison between them:
compare_res.zip

Thank you for the testing. I'll look into this issue. Meanwhile, with the version v1.1.0, I think you can run TRUST4 without "--repseq" option. Could you please test whether removing this option would yield better results? You can rerun TRUST4 with the option "--stage 1" to save some running time (you may need to rename some of your previous result files).

Thank you so much for such a quick reply. I will redo it as you have recommed and give you feedback as soon as possible. :)

Dear Dr. Li,

I have conducted TRUST4 analysis on the same 10x dataset again, this time without using the "--repseq" parameter. The results appear to be improved, but they still do not match the results presented in the 10x report. Additionally, the TRB-assembled result seems to be inferior compared to the TRA analysis.

Furthermore, I have not skipped the step for extracting reads this time. Therefore, I have compared the mapped reads, including VDJ-mapped reads and CDR3-motif reads, from two times - one with the "--repseq" and one without it. Interestingly, I have discovered that there are more candidate reads (mapped reads) identified when the "--repseq" is not used.

Could you kindly explain the purpose of the "--repseq" command? I would like to summarize the candidate reads using TRUST4 directly to accurately quantify how my TCR-enrichment work. I am curious to know how I should correctly set the command to achieve this.

It would be so nice that you can take a time to look into this issue.

The new result of TRUST4:
trust4_res_new.zip
The new comparison I have summarized:
compare_res.zip

The "--repseq" invokes aggressive read trimming for the portion that does not align well to the V, J, C genes during assembly stage, this may lose some real read signals. This option is designed for bulk TCR/BCR-seq for computational efficiency purpose. It should not affect your extracted "toassemble" files, but the "assembled_reads" will definitely be affected.

I will look into the issue you find these a few days. Thank you for sharing the results!

I noticed one possible reason for the difference results from MIXCR and TRUST4 in your table. The MiXCR's count is based on the productive count, where the underlying assembly might not be complete VDJ (from 5'V to 3'J). The TRUST4's count is based on the completeVDJ, which is a much stringent criteria. This also makes sense for the strong impact from "--repseq" option you see, as the option may trim too much and missing the first a few and last a few base pairs in the completeVDJ sequence.

Regarding TRB and TRA, it seems there are more cells with TRB CDR3 that can be assembled. Nevertheless, I think TRB should also have more completeVDJ than TRA's. If possible, is it possible to share the raw reads from a few cells(barcodes) in your data set?

Dear Dr. Li,

thanks for your attention to this issue.

  1. I appreciate the possible reason you mentioned, but I am still confused. The definition of "cells with productive contig of TCR" seems to be more stringent than "only complete VDJ." It requires that the fraction of cell-associated barcodes have at least one contig that spans the 5' end of the V region to the 3' end of the J region for TCR, has a start codon in the expected part of the V sequence, has an in-frame CDR3, and has no stop codons in the aligned V-J region. Although there may be different understandings, I think the statistic number I have summarized at least refers to cells with complete VDJ. However, according to the report given by 10x, the results between mixcr and trust4 show difference.

  2. I am currently redoing my dataset without using the --repseq. Once the new results are available, I will share them with you as soon as possible.

Thanks again for your attention and patience.

Best wishes,
Yuyu

The 10x genomics definition for "productive" is defined on completeVDJ sequences. The other tools may have a different definition. Could you please also share MiXCR's output? Thank you.

I just noticed you already shared the data source. I'll try to recreate the issue, so you don't need to extract reads from some cells.

Dear Dr. Li,

the MiXCR's output can be downloded from this link: https://www.10xgenomics.com/cn/datasets/t-cells-from-bal-bc-mice-1-k-cells-multi-v-2-2-standard-5-0-0.

Thanks again :)

Oh, I don't think that's MiXCR's output, they should come from "cellranger vdj". Then indeed, the productive should suggest that they are completeVDJ.

I think I roughly know which part may cause this issue. This might take some time to fix. I'll try to do it these a few days.

That would be so great! Thank you so much for your effort! I am looking forward to hearing you. :)

I think I've improved the results and now I can get more TRB completeVDJ than TRA now. Both numbers also improved from before. The new code may have other impacts, so I need some time to do more tests. In the meantime, could you please checkout the code at the "barcode_kmercount" branch from github and give it a try? Thank you!

Thank you Dr. Li. Yes, of course, I can test it :)

However, I am afraid that I can not installed the trust4 easily with the .zip file directly, because the zlib needed by TRUST4 is missed in my account of hpc, and I seem to have no permission to install it. Thus, I can always update TRUST4 successfully only per conda :(. But I will try it again this time. Therefore, do you have any other idea by which I can install trust4 in this case?

Thank you so much again for your effort.

Yuyu

I think you can try "make" in the conda environment where you installed TRUST4.

I have tried many times in my conda environment, however, it shows alwayws the same error: ReadFiles.hpp:7:10: fatal error: zlib.h: No such file or directory.

Last time, I have installed it successfully with commenting out several lines in Makefile referring to bamextractor, cause only samtools depends on the zlib. But, after v1.1.0 it doestn't work any more.

Could you please try adding "-I$conda_path/envs/trust4/include/" (you need to replace conda_path with the one on your system) to the "LINKPATH=xxxx" line in the makefile?

Thank you for your tips. I have finally figured out the installation of zlib now :). However, there is still an error according tot he bam-extractor. I have skipped it anyway to test the assembly directly now. Thanks :)

Thank you for your tips. I have finally figured out the installation of zlib now :). However, there is still an error according tot he bam-extractor. I have skipped it anyway to test the assembly directly now. Thanks :)

Just in case you need it in the future. For bam-extractor, you need to modify the Makefile of the samtools library in the folder.

Dear Dr. Li,

I have tested the new update for 10x data, and I am pleased to report that the number of cells with complete VDJ for TRA and TRB has increased. In particular, the number of cells with complete VDJ for TRB has significantly risen, with a total of 809 cells. This is a great improvement.

I am quite intrigued to know which specific aspect you have worked on to achieve it? And, I would concern about whether this change will impact the confidence level of the assembly.

Btw, I am also testing this update on my data :). When they come out, I would like to give you feedback also.

Thank you so much!

Best wishes,
Yuyu

I am quite intrigued to know which specific aspect you have worked on to achieve it? And, I would concern about whether this change will impact the confidence level of the assembly.

The main change is the k-mer counting step. TRUST4 will first sort the reads based on their k-mer count to prioritize reads that are likely from more abundant transcripts or more trustworthy sources. In single-cell setting, each cells may prioritize reads from clonally expanded cells. I guess their are some leaked mRNA during the assembly, so the reads from a cell can be contaminated by other cells. In this case, it may prioritize wrong reads and cause fragmented assembly. The change I made is to do another round of barcode-wise k-mer count, and prioritize reads with high barcode-wise k-mer count. If there is a tie, then I will prioritize the reads with higher global k-mer count.

Btw, I am also testing this update on my data :). When they come out, I would like to give you feedback also.

Thank you for these testings! I'm looking forward to the results. I'm also testing some other data sets to see what are the impact on regular 5' scRNA-seq data.

Dear Dr. Li,

Thank you so much for your explanation. I think the new update should be more reasonable for the spatial data, where there maybe not dorminat clonally expanded spot? The assembly of each spot would be better performed separately?

And I have tested my data, they show the same trend that there are more spots with complete VDJ of TRA/TRB. :) I will compared also the assembled result further.

Additionally, I wonder whether there is a reasonable way to evaluate the confidence of the assembled colontype of a spot? Previously, I just used the CDR3-abundance or the average coverage of each contig to evaluate it. Do you have any idea on it?

Thanks again for your effort.

Best wishes,
Yuyu

I think average coverage is a good criteria for completeVDJ assemblies. Since you are more interested in TCR data, the V and J gene identity could also be useful. As you may not expect to see very low germline alignment similarity. A more complex way probably is to check the _final.out file, which provides the base-wise read support information.

Dear Dr. Li,

Thanks for the tips. Then I am moving to these two critera. :)

Yuyu