liulab-dfci / TRUST4

TCR and BCR assembly from RNA-seq data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How are Consensus IDs Created?

ejohnson643 opened this issue · comments

Hi! Thanks for making and maintaining this program!

I have a situation where I had to split my single-cell fastqs in half in order to run TRUST4 until I can get access to more computing resources, and I'm trying to wrap my head around merging them back together now. It seems like the consensus IDs are unique to each CDR3 (up to the _XXX number at the end), is this in fact the case? The idea being that I could recreate some of the output files by merging on the consensus ID. I'd appreciate any information about how these IDs are assigned and whether this makes sense. Thanks!

In single-cell setting, the alphabetic part of the consensus ID is the cell id , and the _XXX is the index for the consensus in that cell. For example, the consensus for IGH and IGK chains from the same cell could be cell_0 and cell_1.

I think a more convenient approach might be to split the reads based on the barcode, so you can just concatenate the report in the end.

Thank you for a very quick response!

I see, so then in the _report.tsv, the cid corresponds to the barcode with the most reads of that CDR3?

It seems like the _barcode_airr.tsv will be useful for me here. Can you confirm that the consensus-count column in that output is the number of reads (UMIs?) of that consensus contig in that barcode?

Thanks so much for your help!

Depends on your running command for TRUST4. If you provide the --UMI option, it will be UMI count, otherwise it will be raw read fragment counts.

This is what I had run, but maybe what I'm understanding now after looking at other issues is that I may have needed ... -1 *R1* -2 *R2* --barcode *R1* --UMI --barcode *R1* --readFormat bc:0:15,um:16:-1 ...

!run-trust4 \
-f ../References/hg38_bcrtcr.fa \
--ref ../References/human_IMGT+C.fa \
-1 ../gdTCR/fastqs/*_R1_*.gz \
-2 ../gdTCR/fastqs/*_R2_*.gz \
--barcode ../gdTCR/fastqs/*_R1_*.gz \
--readFormat bc:0:15,um:16:-1,r1:0:-1 \
--barcodeWhitelist ./737K-august-2016.txt \
-t 4 \
-o ../gdTCR/output/

I appreciate your insight here, it's hard to parse these arguments from the documentation.

It seems read1 is contains barcode and UMI, I'm wondering is your data is actually single-end? If so, the parameter can be

run-trust4 \
-f ../References/hg38_bcrtcr.fa \
--ref ../References/human_IMGT+C.fa \
-u ../gdTCR/fastqs/*_R2_*.gz \
--UMI ../gdTCR/fastqs/*_R1_*.gz\
--barcode ../gdTCR/fastqs/*_R1_*.gz \
--readFormat bc:0:15,um:16:-1,r1:0:-1 \
--barcodeWhitelist ./737K-august-2016.txt \
-t 4 \
-o ../gdTCR/output/

I guess the large memory cost from the previous run could be due to that you used the barcode and UMI sequence as read sequence, which could create a lot of divergent paths in assembly and cause the blow-up of memory.

I was wondering if I had done something wrong with the setup! I'll give it a try and close this issue if everything seems ok.

Thank you so much for your help, I really appreciate it.

If the read length from R1 file is 28bp, then your data is single-end for sure. Otherwise, you may need to use --readFormat to specify the actual sequence range in read1.

You were entirely correct, and this was not exactly what I was expecting from the data based on the protocol I was looking at. Thank you very much for your help.

For everyone else's reference this call worked for me (although it did still take a long time and consume a lot of memory, and it's basically the same as what @mourisl indicated above):

run-trust4 \
-f ../References/hg38_bcrtcr.fa \
--ref ../References/human_IMGT+C.fa \
-u ../gdTCR/fastqs/*_R2_*.gz \
--UMI ../gdTCR/fastqs/*_R1_*.gz \
--barcode ../gdTCR/fastqs/*_R1_*.gz \
--readFormat bc:0:15,um:16:-1,r1:0:-1 \
--barcodeWhitelist ./737K-august-2016.txt \
-t 8 \
-o ../gdTCR/output/