liulab-dfci / TRUST4

TCR and BCR assembly from RNA-seq data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Question] How to run using 10X data?

jolespin opened this issue · comments

I'm trying to follow the tutorial you have here: https://github.com/liulab-dfci/TRUST4#10x-genomics-data-and-barcode-based-single-cell-data

However, I'm not sure how to adapt my data.

I have R1 reads that look like this where the reads are 28 bp long:

@A00588:95:H2H5KDRX3:1:1101:1163:1000 1:N:0:GTAACATGCG+AGGTAACACT
AGCTATCTACTTCTGGTACAACCCACTN
+
FFFF,FFFFF:FFFFFFF:FF:FFFFF#

My R2 reads look like this and they 90 bp long:

@A00588:95:H2H5KDRX3:1:1101:1163:1000 2:N:0:GTAACATGCG+AGGTAACACT
ATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTACTCTTGACATCCATGGAATCTTGTAGAGATACGGGAGTGCCTTCGGGACC
+
:,FFFFFF:::F:,F::F:FF:FFFF:FF:FF:F:F:::FFFFFFFF,FFF,,FF,,FFF:FF:FFF:F,FFFF,FFFFFFFFFFF,,F

My usual TRUST4 command for paired end data is the following:

run-trust4 -1 ${R1} -2 ${R2} -o ${OUT_DIR}/trust4 -f References/hg38_bcrtcr.fa --ref References/human_IMGT+C.fa

How can I update to use 10X data?

Following the tutorial, how can I know what the barcode is above to follow this usage? -1 read1.fq.gz -2 read2.fq.gz --barcode read1.fq.gz --read-format bc:0:15,r1:16:-1

Refs: https://www.10xgenomics.com/support/single-cell-gene-expression/documentation/steps/sequencing/sequencing-requirements-for-single-cell-3

I think the read1 is for barcode and UMI, and the first 16bp is barcode, followed by 10bp-UMI. So your command can be
"-u read2.fq.gz --barcode read1.fq.gz --read-format bc:0:15".

Thanks for getting back to me so quickly!

Just to be clear, the barcode is being parsed from the actual sequence (not the read id):

# Not this
>>> "A00588:95:H2H5KDRX3"[:16]
'A00588:95:H2H5KD'

So this is the barcode? (Length=16)

>>> "AGCTATCTACTTCTGGTACAACCCACTN"[:16]
'AGCTATCTACTTCTGG'

and this is the UMI? (Length=12)

>>> "AGCTATCTACTTCTGGTACAACCCACTN"[16:]
'TACAACCCACTN'

A few follow up question:

  • You mentioned there was a 16 bp barcode and 10 bp UMI but the reads are 28 bp long. Is this standard or do I need to make further adjustments?
  • When providing the -u and --barcode arguments, do you still need to provide the -1 and -2 arguments?

Oh, sorry for my typo, the UMI is 12bp in your data. The barcode and UMI will be parsed from the actual sequence.

The -u means the read is single-end, and -1/-2 is for paired-end data. In your case, it is single-end data and the sequence is in the R2 file.

Oh ok cool, so I can still use --readFormat bc:0:15?

Also, I ran this on a test and I didn't get any errors/ran to completion:

run-trust4 -o ${OUT_DIR}/trust4 -f References/hg38_bcrtcr.fa --ref References/human_IMGT+C.fa -u ${R2} --barcode ${R1} --readFormat bc:0:15 -1 ${R1} -2 ${R2}

Should I have run this instead without -1 ${R1} -2 ${R2}?

run-trust4 -o ${OUT_DIR}/trust4 -f References/hg38_bcrtcr.fa --ref References/human_IMGT+C.fa -u ${R2} --barcode ${R1} --readFormat bc:0:15 

You should run without -1, -2.

Thanks for your help. Last question and I will close the issue.

How did you know the barcode was the first 16 bp of read 1 and the UMI was last 12. Is this a standard experimental design?

Yes, this is one of standards 10x uses. If you have barcode whitelist, you can manually inspect the composition by checking some barcode.