liulab-dfci / TRUST4

TCR and BCR assembly from RNA-seq data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Question about read aggregation by barcode

ejohnson643 opened this issue · comments

Hi!

I am attempting to run TRUST4 on a single-cell V(D)J data set and it is running rather slowly, so I was peeking under the hood to see if there were any easy opportunities for parallelization. I noticed that even in the single-cell case, the main call to TRUST4 assembles a full list of reads, roughly annotates them, and then sorts them, but when we have barcodes, don't we want to group them by barcode and then sort and process? I have 250 million reads, so this is quite slow as a serial process, but if I were to split my inputs by barcode and run TRUST4 in parallel on each of them, I shouldn't have an issue, correct?

Basically, is there any sharing of information between reads before they are separated by barcodes that I need to watch for, or can I treat each barcode as a bulk sample and process them that way?

Thanks for any insight!

For what it's worth, here is my script in case I am doing something dumb with any of the arguments. I find that the algorithm starts to hang in the phase of TRUST4 where it is reading in and processing the reads.

#!/usr/bin/bash

donor_id="CD001"
exp_id="control_5cancer"

gdTCR_dir="/mnt/data0/TCR_Discovery_Platform/SequencingData/gdTCR"
ref_dir="/mnt/data0/References"

run-trust4 \
    -u $gdTCR_dir/fastqs/merged_${donor_id}_${exp_id}_R2.fastq.gz \
    -f $ref_dir/hg38_bcrtcr.fa \
    --ref $ref_dir/human_IMGT+C.fa \
    -o ${donor_id}_${exp_id} \
    --od $gdTCR_dir/output/TRUST4/${exp_id} \
    -t 24 \
    --barcode $gdTCR_dir/fastqs/merged_${donor_id}_${exp_id}_R1.fastq.gz \
    --barcodeWhitelist $ref_dir/737K-august-2016.txt \
    --UMI $gdTCR_dir/fastqs/merged_${donor_id}_${exp_id}_R1.fastq.gz \
    --readFormat bc:0:15,um:16:-1,r1:0:-1 \
    --outputReadAssignment \
    --stage 1 \

Yes, that's actually a function I'm planning to work on for a while, but need to find a large block of time to implement that.
For your implementation, I think the proper way is to run fastq-extract first, that will also generate error-corrected barcodes. Then you can split the reads based on barcode bunches, and run "run-trust4" with the option "--noExtraction".

Since your data is VDJ amplified, the "--repseq" option may help with the speed without hacking the multithreading.

Amazing! Thank you!

It seemed like the sort of next step to take this program, so I'll take a look and see if I can contribute anything as well!

Thanks for the advice on the flags, I'll give them a try.