analysis for multiple runs per sample

Question

analysis for multiple runs per sample

shishuo16 opened this issue 4 months ago · comments

Hi,

Thank you for this great tool! I have a question about how to merge multiple runs per sample. I have two fastq1 and fastq2 files per sample (four fastq files), I wonder in which step should I merge them together ? "java -jar picard.jar FastqToSam" or "Drop-seq_alignment.sh" ? And can you provide a usage example ?

Thanks in advance.

alecw · Answer 1 · Wed Apr 10 2024 22:41:09 GMT+0800 (China Standard Time)

Hi @shishuo16 ,

It depends. Are these multiple 10X reactions, or reads from the same 10X reaction? If they are from multiple 10X reactions, then you shouldn't merge them without some thought, because you might get the same cell barcode in different reactions and you don't want to combine those.

If it's all the same 10X reaction, then you have choices that are all fine, but depend on your computing situation.

You could simply concatenate (using cat) all your fastq1s, and do the same for all your fastq2s. Just make sure you cat them in the same order for both read1 and read2. Note that even if the fastqs are gzipped, you can use cat.
You could run the Drop-seq_alignment.sh workflow independently on each fastq pair, and then merge the aligned BAMs, using Picard MergeSamFiles or samtools merge.

The first approach is simpler. The second approach allows you to do a lot of processing in parallel, which depending on your compute environment might be faster.

Regards, Alec

Marcel Schilling · Answer 2 · Wed Apr 10 2024 23:11:47 GMT+0800 (China Standard Time)

The 2nd approach also has the advantage that when you add more samples later you don't have to re-run all the processing for the previous samples but just for the new data and merge the results.

Shuo Shi · Answer 3 · Thu Apr 11 2024 08:56:22 GMT+0800 (China Standard Time)

Got it, thank you!

Shuo Shi · Answer 4 · Fri Apr 12 2024 09:11:23 GMT+0800 (China Standard Time)

@alecw
Hi, I used your first suggestion (cat fastqs), however when I run Drop-seq_alignment.sh, it went to wrong
`
INFO 2024-04-11 13:11:54 TagBamWithReadSequenceExtended

********** NOTE: Picard's command line syntax is changing.

********** For more information, please see:
********** https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)

********** The command line looks like this in the new syntax:

********** TagBamWithReadSequenceExtended -SUMMARY /media/dell/171fe25b-15b7-4827-8508-ab06a0a8b3f4/single_cell/GSE147082/broad_Drop-seq/GSM4416535/align//unaligned_tagged_Cellular.bam_summary.txt -BASE_RANGE 1-12 -BASE_QUALITY 10 -BARCODED_READ 1 -DISCARD_READ false -TAG_NAME XC -NUM_BASES_BELOW_QUALITY 1 -INPUT /media/dell/171fe25b-15b7-4827-8508-ab06a0a8b3f4/single_cell/GSE147082/broad_Drop-seq/GSM4416535/GSM4416535_unaligned_read_pairs.bam -OUTPUT /media/dell/171fe25b-15b7-4827-8508-ab06a0a8b3f4/single_cell/GSE147082/broad_Drop-seq/GSM4416535/align/tmp//unaligned_tagged_Cell.bam

13:11:54 [main] WARN com.intel.gkl.NativeLibraryLoader - Unable to load libgkl_compression.so from native/libgkl_compression.so (没有那个文件或目录)
13:11:54 [main] WARN com.intel.gkl.NativeLibraryLoader - Unable to load libgkl_compression.so from native/libgkl_compression.so (没有那个文件或目录)
[Thu Apr 11 13:11:54 CST 2024] TagBamWithReadSequenceExtended INPUT=/media/dell/171fe25b-15b7-4827-8508-ab06a0a8b3f4/single_cell/GSE147082/broad_Drop-seq/GSM4416535/GSM4416535_unaligned_read_pairs.bam OUTPUT=/media/dell/171fe25b-15b7-4827-8508-ab06a0a8b3f4/single_cell/GSE147082/broad_Drop-seq/GSM4416535/align/tmp/unaligned_tagged_Cell.bam SUMMARY=/media/dell/171fe25b-15b7-4827-8508-ab06a0a8b3f4/single_cell/GSE147082/broad_Drop-seq/GSM4416535/align/unaligned_tagged_Cellular.bam_summary.txt BASE_RANGE=1-12 BARCODED_READ=1 DISCARD_READ=false BASE_QUALITY=10 NUM_BASES_BELOW_QUALITY=1 TAG_NAME=XC TAG_BARCODED_READ=false HARD_CLIP_BASES=false TAG_QUALITY=XQ VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Thu Apr 11 13:11:54 CST 2024] Executing as dell@dell-PowerEdge-T550 on Linux 6.5.0-18-generic amd64; OpenJDK 64-Bit Server VM 11.0.13+7-b1751.21; Deflater: Jdk; Inflater: Jdk; Provider GCS is not available; Picard version: 2.5.3(e21ba24_1679595148)
13:11:54 [main] WARN com.intel.gkl.compression.IntelInflaterFactory - IntelInflater is not supported, using Java.util.zip.Inflater
13:11:54 [main] WARN com.intel.gkl.compression.IntelDeflaterFactory - Intel Deflater not supported, using Java.util.zip.Deflater
ERROR 2024-04-11 13:11:54 TagBamWithReadSequenceExtended Reads not properly paired! R1: 1 R2: 1
`

I wonder is it because the two fastq1 have same reads name ?

head of the first fastq1 file:
@1
CTCCCNAGAGGAATCCTTCG
+
AAAA/#/EE6AEEE<E/E/E
@2
CGGGTNTGGCGCCGGGCAAG
+
AAA/A#EEEEEEEEEEEEEE
@3
TGCCGNACGCATCTTGGGTG

head of the second fastq1 file:
@1
NAAGGACCTATCGGGCGCTACTGGG
+
#FFFFFFFFFFFFFF:FFFFF::::
@2
NTTCGTACACTTGCGGCGGAGTGAC
+
#FFFFFFFFFFFFFFFFFFFFF:,,
@3
NTGCCTGTTCACCGCAAGATGGTTG

alecw · Answer 5 · Fri Apr 12 2024 09:29:54 GMT+0800 (China Standard Time)

Hi @shishuo16 ,
The integer read names are suspicious. These are not normal Illumina FASTQs. Where did the reads come from. Did pipseeker produce these?

Regards, Alec

Shuo Shi · Answer 6 · Fri Apr 12 2024 09:34:23 GMT+0800 (China Standard Time)

Hi @alecw
I use prefetch command download the sra file from NCBI (https://www.ncbi.nlm.nih.gov/Traces/study/?acc=PRJNA612966&o=acc_s%3Aa), And then use "fastq-dump.3 ./fastq/${line}/${line}.sra --split-files --origfmt --defline-qual '+' --gzip -O ./fastq/" command to transform them from sra to fastq, and I got the fastq files

alecw · Answer 7 · Fri Apr 12 2024 09:41:14 GMT+0800 (China Standard Time)

If your multiple runs have the same read names, you can't cat them like I suggested originally. If these are from the same 10X reaction, which I'm assuming they are since you chose to cat them, you'll need to do something to disambiguate the read names across pairs of fastqs.