Question about read aggregation by barcode
ejohnson643 opened this issue · comments
Hi!
I am attempting to run TRUST4 on a single-cell V(D)J data set and it is running rather slowly, so I was peeking under the hood to see if there were any easy opportunities for parallelization. I noticed that even in the single-cell case, the main call to TRUST4 assembles a full list of reads, roughly annotates them, and then sorts them, but when we have barcodes, don't we want to group them by barcode and then sort and process? I have 250 million reads, so this is quite slow as a serial process, but if I were to split my inputs by barcode and run TRUST4 in parallel on each of them, I shouldn't have an issue, correct?
Basically, is there any sharing of information between reads before they are separated by barcodes that I need to watch for, or can I treat each barcode as a bulk sample and process them that way?
Thanks for any insight!
For what it's worth, here is my script in case I am doing something dumb with any of the arguments. I find that the algorithm starts to hang in the phase of TRUST4 where it is reading in and processing the reads.
#!/usr/bin/bash
donor_id="CD001"
exp_id="control_5cancer"
gdTCR_dir="/mnt/data0/TCR_Discovery_Platform/SequencingData/gdTCR"
ref_dir="/mnt/data0/References"
run-trust4 \
-u $gdTCR_dir/fastqs/merged_${donor_id}_${exp_id}_R2.fastq.gz \
-f $ref_dir/hg38_bcrtcr.fa \
--ref $ref_dir/human_IMGT+C.fa \
-o ${donor_id}_${exp_id} \
--od $gdTCR_dir/output/TRUST4/${exp_id} \
-t 24 \
--barcode $gdTCR_dir/fastqs/merged_${donor_id}_${exp_id}_R1.fastq.gz \
--barcodeWhitelist $ref_dir/737K-august-2016.txt \
--UMI $gdTCR_dir/fastqs/merged_${donor_id}_${exp_id}_R1.fastq.gz \
--readFormat bc:0:15,um:16:-1,r1:0:-1 \
--outputReadAssignment \
--stage 1 \
Yes, that's actually a function I'm planning to work on for a while, but need to find a large block of time to implement that.
For your implementation, I think the proper way is to run fastq-extract first, that will also generate error-corrected barcodes. Then you can split the reads based on barcode bunches, and run "run-trust4" with the option "--noExtraction".
Since your data is VDJ amplified, the "--repseq" option may help with the speed without hacking the multithreading.
Amazing! Thank you!
It seemed like the sort of next step to take this program, so I'll take a look and see if I can contribute anything as well!
Thanks for the advice on the flags, I'll give them a try.