How to use fastq-extractor to deal with paired-UMI sequencing data in trust4 suite

Question

How to use fastq-extractor to deal with paired-UMI sequencing data in trust4 suite

yqyuhao opened this issue 3 months ago · comments

yu hao · Answer 1 · Fri Mar 29 2024 13:49:18 GMT+0800 (China Standard Time)

Dear editor
How to use fastq-extractor to deal with paired-UMI sequencing data in trust4 suite? My library structure is 3M3S+T，3M3S+T.

Li Song · Answer 2 · Fri Mar 29 2024 22:46:41 GMT+0800 (China Standard Time)

Sorry, I don't quite get your question. What is the paired-UMI sequencing? What's the meaning of 3M3S+T?

yu hao · Answer 3 · Sat Mar 30 2024 23:11:05 GMT+0800 (China Standard Time)

Yes, I used to KAPA universal UMI adapter to prepare the library. Universal UMI adapter for ligation-based library construction prior to sample barcoding in the KAPA HyperCap workflow and KAPA HyperPETE Workflow SomaticTissue DNA, KAPAHyperPETE Workflow Somatic Plasma cfDNA. Usually, we use fgbio software to deal the raw sequencing data, The read structure is defined as 3M3S+T. Extract the first (3) bases off the start of the read (3S). These bases constitute a punctuation sequence that increases the sequence diversity to ensure optimal sequencing performance. Maintain the remaining sequence as the part of the insert read (+T). The UMIs extracted from read 1 and read 2 are stored in the RX tag of the unmapped BAM file as UMI1-UMI2.

Li Song · Answer 4 · Sat Mar 30 2024 23:54:42 GMT+0800 (China Standard Time)

Do you mean the UMI is 6bp, where the first 3 bp are from the beginning of read1 and the other 3bp come from the beginning of read2?

yu hao · Answer 5 · Tue Apr 02 2024 09:36:12 GMT+0800 (China Standard Time)

Yes, I don't know how to use fastq-extractor to deal with this data.

Li Song · Answer 6 · Tue Apr 02 2024 10:05:51 GMT+0800 (China Standard Time)

The current implementation cannot handle barcode/UMI across two files. You may need to implement your own script to reformat the files. I will think about how to implement it with the current framework in the future.

yu hao · Answer 7 · Tue Apr 02 2024 11:54:33 GMT+0800 (China Standard Time)

Based on the present situation，how can I reformat the files to meet the requirement of the fastq-extractor tools?

Li Song · Answer 8 · Tue Apr 02 2024 11:57:34 GMT+0800 (China Standard Time)

You can extract and concatenate the 3bp from each read to create another file as "XXX_barcode.fq". You shall also remove those UMI sequences from read sequences. With those you can run TRUST4 from the wrapper "run-trust4" with the extra option like "--barcode XXX_barcode.fq --barcodeLevel molecule".

yu hao · Answer 9 · Tue Apr 02 2024 12:14:30 GMT+0800 (China Standard Time)

Thank you for providing the information. In the scenario you described, using the UMI from read1 as the barcode sequence and the UMI from read2 as the UMI sequence, while considering only the UMI from read2 as a group during the assembly process, indeed neglects the role of the UMI from read1. This approach may not be applicable in all scenarios as UMIs (Unique Molecular Identifiers) are typically used to mark read pairs that originated from the same original molecule, facilitating their distinction in subsequent analyses.

In a standard UMI processing workflow, the UMIs from read1 and read2 (or additional read pairs, if applicable) should be consistent, allowing them to be used to group all reads originating from the same molecule. This grouping is crucial for accurate data processing in subsequent steps such as deduplication and error correction.

Ignoring the UMI from read1 and only considering the UMI from read2 can lead to the loss of important information, compromising the accuracy of data processing. For instance, if you attempt to deduplicate based on the UMI, only considering the UMI from read2 could mistakenly consider read pairs from the same molecule as distinct.

Therefore, when designing and implementing a UMI analysis workflow, it is essential to ensure that both the UMI from read1 and read2 (and additional read pairs, if applicable) are properly utilized and considered. This ensures data integrity and accuracy, leading to more reliable analysis results.

Li Song · Answer 10 · Tue Apr 02 2024 12:20:42 GMT+0800 (China Standard Time)

Sorry for the confusion. I mean you need to concatenate the UMI portion from the two reads into one, and then dump them into one fasta file. This file will essentially be the UMI file.

The --UMI option in TRUST4 is for abundance estimation only for data like 10x Genomics. In your case, the UMI is served as molecule-level barcode. Therefore, the appropriate option is regarding this file as a barcode, and then use the option "--barcodeLevel molecule" to specify this is a true UMI.

Just curious, with 6bp UMI, it is very easy to have UMI conflict (two molecules use the same UMI), is it a concern in your data?