Why are bwa and BWA-MEME results inconsistent?

Question

Why are bwa and BWA-MEME results inconsistent?

yukaiquan opened this issue 6 months ago · comments

Dear developer:

bwa: Version: 0.7.17-r1188
BWA-MEME:v1.0.5

bwa stat of bam:
338883556 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
448166 + 0 supplementary
0 + 0 duplicates
330144737 + 0 mapped (97.42% : N/A)
338435390 + 0 paired in sequencing
169217695 + 0 read1
169217695 + 0 read2
322879394 + 0 properly paired (95.40% : N/A)
329460362 + 0 with itself and mate mapped
236209 + 0 singletons (0.07% : N/A)
5641738 + 0 with mate mapped to a different chr
2394586 + 0 with mate mapped to a different chr (mapQ>=5)
BWA-MEME stat of bam:
338883548 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
448158 + 0 supplementary
0 + 0 duplicates
330144743 + 0 mapped (97.42% : N/A)
338435390 + 0 paired in sequencing
169217695 + 0 read1
169217695 + 0 read2
322879718 + 0 properly paired (95.40% : N/A)
329460388 + 0 with itself and mate mapped
236197 + 0 singletons (0.07% : N/A)
5641548 + 0 with mate mapped to a different chr
2394572 + 0 with mate mapped to a different chr (mapQ>=5)

Young-mok Jung · Answer 1 · Wed Jan 31 2024 15:17:20 GMT+0800 (China Standard Time)

Hi yukaiquan,

Thank you for trying out and reporting the issue.

There are randomness within BWA, BWA-MEM2, BWA-MEME due to chunk size that changes according to the number of threads. bwa 1 bwa mem2 bwa mem2
e.g., chunk (batch) statistics are used for paired mapping

Have you tried comparing the output using a fixed chunk size?

you can set a fixed chunk size using -K option

# Perform alignment with BWA-MEME, add -7 option
bwa-meme mem -7 -Y -K 100000000 -t <num_threads> <input.fasta> <input_1.fastq> -o <output_meme.sam>

# Below runs alignment with BWA-MEM2, without -7 option
bwa-meme mem -Y -K 100000000 -t <num_threads> <input.fasta> <input_1.fastq> -o <output_mem2.sam>

# Compare output SAM files
diff <output_mem2.sam> <output_meme.sam>

# To diff large SAM files use https://github.com/unhammer/diff-large-files

Thanks!

yukaiquan · Answer 2 · Wed Jan 31 2024 17:02:31 GMT+0800 (China Standard Time)

Hi quito418:
Thank you very much for your patient explanation, the results are consistent after adding -K.

Can the index be loaded only once when comparing thousands of samples in batches? Reading the index takes a lot of time.

Thanks!

Young-mok Jung · Answer 3 · Thu Feb 01 2024 10:36:03 GMT+0800 (China Standard Time)

Glad to hear it worked :)

At the moment, we have not developed a method for loading index once and reusing the loaded index.

Below are my suggestions that can be applied now:

use linux disk cache (when you read/write file the file is cached in the ram at default). Hence if you run BWA-MEME sequentially in a same linux machine, the next time index is read, it will be loaded from the memory (which is 3-5 GB/sec in IO speed)
use RAM disk. e.g., you may put the index files in the /dev/shm (~40GB for indexes required at runtime). This is similar to first method.

yukaiquan · Answer 4 · Thu Feb 01 2024 10:49:35 GMT+0800 (China Standard Time)

Thanks!
Best wishes!