liulab-dfci / TRUST4

TCR and BCR assembly from RNA-seq data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

the rule of barcode correction

yuyuleung opened this issue · comments

Hello,

I am using TRUST4 on my spatial-RNAseq data, which has 20bp barcode. However, I am not clear, how TRUST4 handels the single-base error within these 20bp-barcode? In other words, when two barcodes standing for a same spatial spot have a single-base difference, will trust4 looks them as a same barcode?

Thank you so much!

Best wishes,
Yuyu

The correction is to match with a barcode whitelist. It corrects one mismatch, and if there is a tie, TRUST4 will use the whitelist barcode frequency observed in the dataset from those perfect barcodes to break ties.

If there is no whitelist given, TRUST4 will not conduct error correction.

Thank you so much for your answer.

When I give the barcodeTranslate file, where two different barcodes stand for a same spot, will trust4 also look them as a same spot?

Thanks again.
Yuyu

Do you mean that in the barcode translate table, two barcode say A, B will be translated into the same barcode C? In that case, TRUST4 will just output barcode C.

Thank you so much for your answer.
I think that I have make you confused, sorry.
For example, folloing is my barcode translate table:

A ATCGATCGATCG <-
A TTCGTTCGATCG <-
B ATCGTTTTCCCC
......

In other words, two barcodes (marked by arrow) with maybe two-base difference stand for same spot A. Will TRUST4 output A also?

Thank you so much!
Yuyu

Yes, both barcodes will be translated into A, and A will be in the final output.

Just want to confirm. Do the two barcodes ATCGATCGATCG and TTCGTTCGATCG still correspond to two cells in your data, and it just mean they are from the same spatial spot A?

Thank you so much for your answer.
According to my data, two barcodes shoube be the same spatial spot A, which also can be looked as a cell.
Therefore, I can just set the barcode translate table instead of barcode whitelist, so that similar barcodes will be translated as a same spot and the correction step can be skipped?

Thanks again. and I wish you a Happy New Year in advance!
Yuyu

Yes, the correction step will be skipped and the output will report the representative TCR or BCR for spot A.
Just curious, why do you have two barcodes for one spot?

Thanks a lot.

Cause my original single spots are smaller, and I wanna merge several samll spots to a bigger one (similar to a cell), so that it runs faster.

Yuyu

Hello,

I have attempted to utilize TRUST4 for analyzing my spatial-mRNA data. However, I am concerned that I might be misunderstanding the outputs.

1.*_report.tsv: I have noticed that the first column in this file records the number of barcodes for the CDR3, instead of the read count. However, I have observed that the cid-column only contains a "cid/barcode". Could you please clarify what this "cid" represents?
image

  1. An alternative mumber following the Translated barcode: In my TranslatedBarcode list, I have translated several cid sequences into a new "cid" label, which follows a coordinate format (e.g., x_y). However, I have come across translated cids that have an alternative number appended to them, like x_y_num. I am unsure of the significance of this alternative number.
    image

3.The number of reads used for barcode assembly: Is there an output file where I can find the read count used for assembling a CDR3 for each barcode?

At present, these are the three points where I am seeking clarification. I greatly appreciate your assistance in addressing these questions.

Best regards,
Yuyu

Just now, I have found a warning during annotation:

Use of uninitialized value $tmpGermline[43] in string ne at ./TRUST4/trust-airr.pl line 368, line 423598. (There are many rows of the same warning.)

I am not sure, where does this error cone from?

By the way, I really want to ask you if there is any way to assemble like 500,000 barcodes efficiently, and each of them has about 10,000 reads?

Thank you so much for your patience.

Best wishes,
Yuyu

1,2. cid is the contig id/name. It has the barcode information if it is single-cell data. Since there could be multiple contigs assembled for a cell, the last number in the ID provides the information to distinguish the contigs from the same cell. They are useful to link the CDR3s to the underlying sequences in the annot.fa file.
3. I think you can look at the barcode_report or barcode_airr file, which list the clonotype information for each barcode, and has the number of reads supporting the CDR3s.
4. This could relate to some file's format errors? Could you please download the newest version of TRUST4 and give it a try?
5. Are these VDJ-targed amplified 10K reads?

Thank you very much for your detailed answer.

According to the fifth point, yes, they are all VDJ-targed amplified reads. However, I have not performed an alignment on them. Do you recommend that these reads should be performed alignment firstly? And is STAR a suitable/useable tool to do it? Or which alignment tool do you suggest to use for VDJ-data?

Thank you so much again!

Best wishes,
Yuyu

I don't think you need to do alignment first. Since these are enriched VDJ reads, you may try the --repseq option to speed up the process.

Thank you so much for your help. I will try them again with this option.

Best wishes,
Yuyu

Hello,

I recently downloaded the latest version of TRUST4 via conda. However, I got an issue when attempting to assemble with the parameter --readFormat bc:0:15. The error message was: Unknown parameter --readFormat.

I was wondering if you could kindly assist me in troubleshooting this problem?

Thank you so much!

Best wishes,
Yuyu

Could you please show me your full command?

YES. Following is my full command:
run-trust4
-u ./test_assemble_1_read_2.fq.gz
-t 8
--barcode ./test_assemble_1_read_1.fq.gz
--readFormat bc:0:15
--barcodeTranslate ./test_assemble_1_barcodeTranslate.tsv
-f ./TRUST4/mouse/GRCm38_bcrtcr.fa
--ref ./TRUST4/mouse/mouse_IMGT+C.fa
--repseq
-o ./test_assemble_1_res

*The format of test_assemble_1_barcodeTranslate.tsv is: "x_y" \t "15bp_barcode"
*The newest version of TRUST4 was downloaded with command "conda install -c bioconda trust4"
*When I just downloaded .zip and unzip it, the command worked good, just with the warning as I have mentioned.

Moreover, I want to ask about the output again.

  1. As you have introduced that the count reported in the report.tsv (the 1st column) is the number of barcode/cell containing the corresponding CDR3. However, why there is only one CID listed in the 'cid' column? Or is the cid with the highest score?
  2. Is the 'consensus_count' of the barcode_airr.tsv the readscount used to assemble the corresponidng CDR3 of a cell/barcode?

Thank you so much again!

Best wishes,
Yuyu

Could you please run "./run-trust4" without any parameters? It will show the version number. Some times you need to specify the version on conda to get the newest version.

For other questions:

  1. That cid is the representative contig. It should be the contig with the most reads covering the corresponding CDR3 region.
  2. Yes.

Here is the feedback with "run-trust4"
image

I am sorry, I have checked the version of trust4 just now, which is 1.0.5.1. I will try again to install the latest one.

Thank you so much again for your patience!

Best wishes,
Yuyu

This is indeed an older version of 1.0.5. You can try something like conda install -c bioconda trust4=1.0.13 to install the latest version. Otherwise, you can also download the github version. TRUST4 does not have many dependencies, it is straightforward to compile from source code.

Okay! Thank you so much for your tips! :)
Yuyu

Hello, I apologize for any confusion caused, but I have come across another question regarding the results of assembling based on multi-barcodes assemble.

I have attempted to perform the assembly of sequences from a specific barcode. However, I have noticed that the final read count used for assembling the same V-gene, J-gene, and CDR3 (colontype), differs from the corresponding barcode's "consensus_count" recorded in the barcode_airr.tsv though the final assembled colontype is same.

Following is my result:
Single-Barcode assemble: (Top1 assembly with 2009 reads)
image

Multiple-Barcode assemble: (The corresponding consensus_count is 827 reads)
image
image
image

Can you please assist me in explaining these differences? I would like to know which read count would be more suitable for determining the reliability of the final assembly.

Thank you so much again for your assistance.

Best wishes,
Yuyu

Does the run of "single-barcode assemble" mean that you extract all the reads from that barcode and run it in the bulk RNA-seq mode? The assembly procedure is slightly different in single-cell mode and bulk mode, such as contig overlap criteria. DIfferent underlying contigs will result in different abundance estimations.

The Single-Barcode assembly was performed in bulk mode.

Could you please explain how the assembly procedure differs between single-cell mode and bulk mode (maybe with a simple example)? I am particularly using the read count used for assembly to determine the reliability of the final assembly. For instance, if there are 100 reads associated with a specific barcode, but only 1 read is used to determine the CDR3, I would consider this resulting CDR3 as potentially inaccurate due to the possibility of bias.

Thank you so much.

Best wishes,
Yuyu

There could be many subtle differences. One is read soring will be different. Using the whole data set will have a different k-mer abundance distribution, so the read sorting and the downstream assembly will be different. Furthermore, in single-cell mode, we use a more lenient threshold when determining whether a read overlaps with a contig. In bulk, the threshold is at least 21bp (also depending on the read length), in single-cell mode, it can be as low as 13.

Thank you so much for your explanation.

Then, I am facing a slight challenge in deciding which mode is suitable for the spatial sequencing data. Should I still use the single-cell mode or assemble each spot as a bulk (but I have more than 5k spots)? I think the final top 1 assembly for each barcode should be same, while the number of reads used to assemble is calculated differently.

However, I am confused as to why I am observing a lower number of reads used to assemble the same clonotype for the specific barcode in single-cell mode than in the bulk mode, even when the threshold (used to determine whether a read overlaps with a contig) is set low?

Best wishes,
Yuyu

Does one spot contain a single cell or multiple cells in your data? Ideally, it shall be in single-cell mode.

I think it is mainly affected by the read order, which may put some other contigs first to be assembled, which will drag away some alignments. Do you see this abundance discrepancy in many other spots as well?

Yes, each spot should be considered as a single cell.

I have recently made a comparison for a particular barcode. However, I plan to examine more barcodes to gain a deeper understanding of their differences.

Once again, thank you sincerely for providing such a detailed explanation.

Best regards,
Yuyu

Dear Professor Li,

Thank you once again for your detailed explanation.

Firstly, I would like to explain an alternative step I have done for my spatial RNAseq dat. Before I performed TRUST4 assembly on my data in single-cell mode, I have separated this large data set (.fq) in 5 parts, according to the barcodes, to speed up the assemble step. Then I would perform the TRUST4-assemble in single-cell mode to these 5 split data set respectively. Finally, I would merge the assemble results together.

However, I have several questions regarding the results obtained from TRUST4 in single-cell mode.

  1. I noticed that the number of assembled reads (merged by 5 assembled_reads.fa files) was significantly lower compared to the alignment results obtained from bowtie2/STAR. These assembled reads, according to your explaination in the other issue, should include both CDR3 reads and reads that may be fully contained within the V gene or C gene regions. I initially expected that the number of assembled reads extracted by TRUST4 would be higher than those identified by general alignment tools, since TRUST4 can potentially "save" more reads containing the CDR3 motif that are typically considered as "unmapped" by general alignment tools.

  2. I performed two separate TRUST4 assemblies on the same data set but using different spot definitions (different sizes of spot). Upon comparing the total number of assembled reads between these two assemblies, I noticed slight variations. This leads me to wonder whether the step of extracting candidate reads also takes the barcode into consideration.

Thank you so much for your assistance.

Best wishes,
Yuyu

  1. How do you count the alignment results from bowtie2/STAR? Since TRUST4 only assembled the first a few hundred bases of the C genes, many reads in the later part of the C gene will not be included in the assembly. Therefore, depending on your kit, such as has some 3' bias, it's possible that the number of reads from V, CDR3, J region is much smaller than the reads from C genes.
  2. I'm not sure I understand the spot definition here. The candidate reads extraction step does not take the barcode into account.

Dear Prof. Li,

Thank you so much for your response.

  1. I have just merged 5 assembled_reads.fa files and counted the reads. Next, I will verify if the majority of my reads are mapped to the C region.

  2. Based on the spot definition, for a large spot, I defined more barcode sequences as a cell-id in the barcodeTranslate.tsv, whereas for smaller spots, fewer barcode sequences were assigned as a cell-id. Due to this difference, there is a slight bias in the number of assembled reads (e.g. 4,148,542 and 4,150,315). Although this bias is minimal, I would like to ensure that I have correctly processed TRUST4 previously.

Thank you once again for your patience and assistance.

Best regards,
Yuyu

I guess some of the barcode fails to translate and will be marked as "missing_barcode" in the toassemble_bc.fa file. These reads will be filtered in the assembly by default.

Thanks the quick answer and tips.
Just now, I have checked them. There is no "missing_barcode" in all toassemble_bc.fa files. I think it is reasonable, cause I have defined all given barcode sequences in the barcodeTranslate.tsv. :)

Dear Prof. Li,

I am sorry that I have to disturb you again about the alignment step. I noticed that some of the reads were identified as VDJC genes by bowtie2 but not by TRUST4. Upon further investigation, I found that most of these reads are not in the C-regions, but rather in the V-regions. And the mapping scores were also relatively high.I have a few examples of these reads and I was wondering if you could help me understand them better. The species I'm working with is human, and I used the hg38_bcrtcr.fa reference file provided in the TRUST4 package.

read1 (mapped to TRAV9-2 recognized by bowtie2 but TRUST4 no)
CTTAGTATCTCTGATACCCTTACTGCTTGGAAGAACCCGTGGAAATTCAGTGACCCAGATGGAAGGGCCAGTGACTCTCTCAGAAGAGGC
TRAV9-2
CACTGTGATTTCTTCATGTTAAGGATCAAGACCATTATTTGGGTAACACACTAAAGATGAACTATTCTCCAGGCTTAGTATCTCTGATACTCTTACTGCTTGGAAGAACCCGTGGAGATTCAGTGACCCAGATGGAAGGGCCAGTGACTCTCTCAGAAGAGGCCTTCCTGACTATAAACTGCACGTACACAGCCACAGGATACCCTTCCCTTTTCTGGTATGTCCAATATCCTGGAGAAGGTCTACAGCTCCTCCTGAAAGCCACGAAGGCTGATGACAAGGGAAGCAACAAAGGTTTTGAAGCCACATACCGTAAAGAAACCACTTCTTTCCACTTGGAGAAAGGCTCAGTTCAAGTGTCAGACTCAGCGGTGTACTTCTGTGCTCTGAGTGA

read2 (mapped to TRAV13-1 recognized by bowtie2 but TRUST4 no)
GGAGGGAGACAGCGCTGTTATCAAGTGTACTTATTCAGACAGTGCCTCAAACTACTTCCCTTGGTATAAGCAAGAACTTGGAAAAGGACC
gene
GATCTTAATTGGGAAGAACAAGGATGACATCCATTCGAGCTGTATTTATATTCCTGTGGCTGCAGCTGGACTTGGTGAATGGAGAGAATGTGGAGCAGCATCCTTCAACCCTGAGTGTCCAGGAGGGAGACAGCGCTGTTATCAAGTGTACTTATTCAGACAGTGCCTCAAACTACTTCCCTTGGTATAAGCAAGAACTTGGAAAAGGACCTCAGCTTATTATAGACATTCGTTCAAATGTGGGCGAAAAGAAAGACCAACGAATTGCTGTTACATTGAACAAGACAGCCAAACATTTCTCCCTGCACATCACAGAGACCCAACCTGAAGACTCGGCTGTCTACTTCTGTGCAGCAAGTA

I have observed that Bowtie2 identifies mapped reads twice as often as TRUST4 does. This discrepancy is concerning to me, and I would appreciate your input on this matter. Thank you!

Best wishes,
Yuyu

Thank you for scrutinizing this step. If you directly put the two reads as input to TRUST4, they will be recognized. How did you search the reads? Is your data paired-end? TRUST4 will merge read pairs, so if you are searching the read sequence, you may not be able to find it directly. Another possibility is the read quality. There are some read quality trimming inside of TRUST4. This will make either the read too short to be used in the assembly stage, or affect your read search (if you are searching using read sequence).

Dear Prof. Li,

thank you so much for the answer.

  1. How did you search the reads?
    I performed alignment using bowtie2 and the complete TRUST4 pipeline (alignment + assembly + annotation) on the same .fq file. Then, I summarized the mapped reads (fqid) identified by bowtie2 and the candidate reads (fqid) identified by TRUST4 (in the assembled_reads.fa). Finally, I found the reads identified as VDJC genes by bowtie2 but not by TRUST4 in the difference set between these two sets of fqid.

  2. Is your data paired-end?
    No, my data is single-end RNAseq, where read1 only contains the barcode sequence.

  3. Another possibility is the read quality.
    I have not examined the read quality yet. Thank you for the suggestion. I will consider the read quality in the next step. By the way, could you please explain more about quality trimming, such as the threshold for trimming reads based on quality? Then, I can check , whether these reads do trimmed in the TRUST4.

Thanks again for your assistance and patience.

Best wishes,
Yuyu

I just checked the running command you posted before, I think the command should be:

run-trust4
-u ./test_assemble_1_read_2.fq.gz
-t 8
--barcode ./test_assemble_1_read_1.fq.gz
--readFormat bc:0:15,r1:16:-1
--barcodeTranslate ./test_assemble_1_barcodeTranslate.tsv
-f ./TRUST4/mouse/GRCm38_bcrtcr.fa
--ref ./TRUST4/mouse/mouse_IMGT+C.fa
--repseq
-o ./test_assemble_1_res

You shall also add r1:16:-1 in readFormat, otherwise the read itself will include the barcode sequence. Though the command before is for mouse data, I guess you used the same command this time?

Oh, yes, I have always used thus commond.

To make it clear: my data looks like: read1 15bp (barcode), read2 90bp (only TCR)

Therefore, although my read2 does not include barcode sequence, I have to also add r1:16:-1 to make clear it?

Thank you so much! This is definietly important to me!

Yuyu

I see. I thought both of them were from read1. You don't need the r1:16:-1 then.

I just remembered you used the "repseq" option. There might be a bug in processing TCR region in this mode. Let me look into this issue. Thank you for providing the example!

Thank you for your continued support and feedback too! I will also follow the read quality! :)
I would like to provide my example fq file (1000 reads) to you to test the processes.
(ps. What I have done to the reads are trimming the adapters, deduplicating completely same reads in a spot)
test_fq.txt
Best wishes,
Yuyu

I think I've found the issue and pushed an update to the main branch. Could you pull the github repo, recompile trust4 and give it a try?

Thank you once again for your support.
No problem. I'll let you know as soon as possible if it gets better.

Dear Prof. Li,

By the way, is it possible that I can only run the first step, namely reads extraction?

Thanks

If you have the log/output on screen from the run-trust4, you can find the command for each step. You can get the command of running "fastq-extractor" there. The updated code does not affect the read extraction stage though.

Dear Dr. Li,

I have tested the updated code. Initially, there were approximately 4M assembled reads, but now there are around 9M reads, slightly exceeding the recognition of bowtie2. Based on these results, I believe the outcome is reasonable.

Thanks a lot for your support throughout this process.

Thank you once again!

Best regards,
Yuyu

By the way, is it possible to synchronize this update with Conda? Cause, when I attempted to install trust4 in Linux using the zip file, I encountered an issue where the zlib package was missing, resulting in the inability to use samtools properly. As a workaround, I had to comment out the steps related to bamextractor (namely the steps relating to samtools) in the Makefile in order to successfully proceed with the other steps.

Could you please consider incorporating this update into Conda as it would ensure a smoother installation process and avoid such complications?

Thank you for your attention to this matter.

Yuyu

How's the assembly speed after relaxing the filtering? I recalled that I removed those reads because in TCR assembly we only need to know their V, J assignment and there was no need to infer the full-length sequence.

The conda version requires the releasing a new version of TRUST4. I want to make sure there are no more other urgent issues before drafting a new release.

Dear Prof. Li,

I have reviewed the time cost and it seems acceptable. Originally, it took about 22 minutes to complete the entire process for approximately 9M reads, while the updated version now takes about 30 minutes. I hope these data can help you.

For my project, I aim to assemble the full length of VDJ for TCR. Therefore, it would greatly benefit me if TRUST4 could retain all essential reads mapped to VDJ-regions. And it would be even greater if TRUST4 could provide users with an overview of their inpuit raw data, specifically indicating the number of reads mapped to VDJ-regions or obtaining CDR3-motifs. This feature would greatly support us in assessing the reliability of the final assembly result.

Thank you once again for your support.

Best wishes,
Yuyu

Thank you for the testing. It's not bad and makes sense to have the full-length TCR assembly.

t would be even greater if TRUST4 could provide users with an overview of their inpuit raw data, specifically indicating the number of reads mapped to VDJ-regions or obtaining CDR3-motifs.

What's the difference between this and the abundance of the CDR3?

Dear Prof. Li,

Although I have not looked into the abundance of the CDR3, there were more spots, in which CDR3 were successfully assembled.
This week, I will continue looking into the abundance of the CDR3. Then, I will keep updating my information to you.

Best wishes,
Yuyu

Dear Prof. Li,

I have double checked the reads used to assemble again with the newest TRUST4 (the new update to the main branch). However, I have noticed that there are still certain reads, potentially mapping to TCR genes (TRAV and TRAJ), which were not included in the assembly. I can provide you with these reads in fq format, along with the mapping results obtained from STAR in sam format.

By the way,Additionally, I would like to mention that these reads were sequenced from human samples and only for TRA enrichment. Before to assembly, I performed several quality control steps, including the use of fastp to filter out low-quality reads (Q30>80%) and cutadapt to trim adapter sequences. As a result, the lengths of these reads vary.

Regarding the reasons why these reads were not used, I have a few hypotheses:

  1. Could it be due to a discrepancy between the threshold of quality filtering I applied and that used by TRUST4?
  2. Were the reads excluded because they were too short in length (most of them were shorter than 50bp)?
  3. Is it possible that the mapping results differ from those provided by TRUST4? I noticed that a majority of these reads were mapped to the reference with Skip (S or N) flags.

I kindly request your assistance in investigating these matters further.

star_mapped2tra_read_2.fq.gz
mapped2tra.sam.zip

Best wishes,
Yuyu

Sorry for the delayed reply. You are right that the majority of the alignments have long introns "xxxN" or soft clips in the CIGAR field, so I think these reads will be aligned poorly to the TRA genes as a whole read and will be filtered by TRUST4. For the remaining reads, I checked some of them manually, and it seems they overlap a lot with the UTR regions, which are not part of the reference sequences. These reads will be filtered during the assembly stage if there is no valid contig for them to anchor on, i.e. the V gene is lowly expressed/used.

The read length and the quality filtering should be fine.

Thank you sincerely for your answer. Then I think TRUST4 works truly fine. I am grateful once again for this useful tool and your support.