Index loading

Question

Index loading

keiranmraine opened this issue 2 years ago · comments

I'm finding it takes ~25 minutes to load the various components of the indexes, without -7 this is only a couple of minutes.

The loading of the core reference files up to the following message runs at ~100% CPU:

* Reading reference genome..
* Binary seq file = /home/kr525/rds/hpc-work/data/ref/Homo_sapiens_assembly38.fasta.0123
* Reference genome size: 6434693834 bp
* Done reading reference genome !!

The section as follows runs at 5-15% CPU indicating disk-wait:

[M::memoryAllocLearned::LEARNED] Reading kmer SA index File to memory
[M::memoryAllocLearned::LEARNED] Reading ref2sa index File to memory

Is there anything obvious relating to the file reading that could account for this?

I expect unrelated, but I did notice that ref.suffixarray_uint64 can be compressed with gzip -1 for 50% reduction in size.
Decompression cost for L1 is negligible compared to the disk latency (and will be more cost effective for systems with IOPs accounting).

Young-mok Jung · Answer 1 · Wed Apr 27 2022 22:33:45 GMT+0800 (China Standard Time)

This is a good point. It is true it takes long to load all indexes with low disk I/O.

I assume you might be loading indexes from HDD disk, where the disk read speed seems around ~100MB/sec.

The problem is that file is just too big and it takes time to load it. When you use SSD (~500MB/s), it can be loaded within ~4 minutes.

We thought about some solutions to this problem before and below are the solutions we have until now.

1) Load only the pos_packed file (~29GB) and build possa_packed (78GB) and inverse suffix array (29GB) at startup.

We tested this option before and it turns out the speed is similar to loading from SSD. (building at startup time is fast)
It is in-memory operation and does not require extra memory space while doing it.
With your current disk read speed, it might still take ~8-9 minutes (read 29GB file + build other indexes).
~~But it is not enabled in the current codebase.~~ (Enabled as default index loading master branch and v1.0.4 release)

2) Rely on OS cache (OS caches files read or written, on free memory space)

This scenario works when the user has a large memory and rerun BWA-MEME sequentially.

3) Depending on the amount of (workload, disk type) use mode2 or mode1 which can be faster.

4) Store index files in place with higher disk I/O

~~If you want to try solution 1), I can provide it within a few(2-3) days.~~
This option is enabled as default index loading in master branch, since v1.0.4.

Keiran Raine · Answer 2 · Wed Apr 27 2022 22:39:23 GMT+0800 (China Standard Time)

I think (1) is the best option to allow cloud solutions to work where you have 2 factors against you:

IOP throttling, heavy blocks of read exhaust quota
Instances are generally destroyed on completion of a job

I'm currently running this on a university cluster, so I was expecting better disk performance than 100MB/s. At Sanger lustre was exceptionally fast, I'll follow up with UoC to find out if there's something I'm missing here.

Young-mok Jung · Answer 3 · Thu Apr 28 2022 15:07:55 GMT+0800 (China Standard Time)

I like to share the numbers I got with option 1.

Tested on 150MB/s hdd , Intel(R) Xeon(R) Silver 4208 CPU @ 2.10GHz, 
---------------------------------------
RMI models: 54 sec (~10GB)
pos_packed file: 191 sec (29GB)
Build index in-memory: 86 sec (used 32 thread)
Total: 5 min 30 sec, 

Intel(R) Xeon(R) Gold 5220R CPU @ 2.20GHz,  
---------------------------------------
Build index in-memory: 43sec (used 48 thread)

If you want to test it right now, the code is available in dev branch.
( you don't need to rebuild the index, just recompile the code)

# recompile the code
make -j 32
# run alignment with runtime-build, 
bwa-meme mem -7 -t 32 <reference> <fasta>

At startup, bwa-meme will read the .pos_packed file and build other indexes based on it.
The in-memory index building is an alternative to reading 77GB of indices.

By the way, I found that bwa-meme alignment speed seems to be moderately higher using mimalloc, even with small threads, while bwa-mem2 is marginally improved.
I will share the number here soon.

Keiran Raine · Answer 4 · Thu Apr 28 2022 16:58:52 GMT+0800 (China Standard Time)

This looks good, rerunning this with some sample data (2x human genome, ~5.2GB of fastq input) I now get close to the run time of bwa-mem2 (550s vs 580s). Previously the best I could do was ~170s longer (best of 5, I suspect network congestion plays a part as they range from 700s-2000s).

I will note that excluding the loading of the reference the sum of the Processed 86958 reads in 3.702 CPU sec values indicates good performance, so the impact of mimalloc will be interesting to compare too:

bwa mem - 7013s (legacy)
bwa-mem2 - 4349s
bwa-meme - 2342s

(grep 'reads in' job.out | perl -ne 'm/(\d+\.\d{3})/;$s+=$1;END{print qq{$s\n};};')

The real sec component sum:

bwa mem - 245s (legacy)
bwa-mem2 - 170s
bwa-meme - 97s

(grep 'reads in' job.out | perl -ne 'm/,\s+(\d+\.\d{3})/;$s+=$1;END{print qq{$s\n};};')

Keiran Raine · Answer 5 · Thu Apr 28 2022 18:56:19 GMT+0800 (China Standard Time)

(edited previous comment to add legacy bwa counts and fix reg-exps for longer times)

Young-mok Jung · Answer 6 · Mon May 02 2022 18:19:39 GMT+0800 (China Standard Time)

I ran some benchmarks regarding mimalloc and below are the results.

Env: Intel(R) Xeon(R) Silver 4208 CPU @ 2.10GHz, 16 physical core (32vCPU)
Dataset: single-end ERR194147_1.fastq
Config: 32 threads

Findings:

Mimalloc does not require much memory
Performance can be affected even with a small number of threads (32 in this case). In another machine with 48 physical core, the performance was not degraded a lot using 48 threads.
It would be better to use mimalloc in general.


Alignment (without disk IO)	Mimalloc	No mimalloc	Seeding	Mimalloc	No mimalloc
BWA-MEME	2518 (sec)	3122	BWA-MEME	949 (sec)	1428
BWA-MEM2	3533.25	3829.96	BWA-MEM2	1970	2117

Memory (GB)	Mimalloc	No mimalloc	BSW	Mimalloc	No mimalloc
BWA-MEME	156.8 (GB)	152.5	BWA-MEME	1143 (sec)	1207
BWA-MEM2	49.1	45.5	BWA-MEM2	1161	1230

max resident memory measured with /usr/bin/time -v

Keiran Raine · Answer 7 · Tue May 03 2022 23:58:43 GMT+0800 (China Standard Time)

This looks like a great addition again. Do you have an idea when it will be possible to get this pushed out to conda?

I'm currently looking at pipelines incorporating various flavours of bwa-mem[2,e]? and I'm pretty sure that you'll need an instance of mbuffer between bwa-meme and samtools fixmate | samtools sort as it's already close to flat out with bwa-mem2 on 32 threads. It will be worth noting this in the README as at the speeds being achieved it isn't simply a drop in replacement (unless no pipes are used). If people want to use with bwa-postalt.js it's better to write to disk (lz4 compressed, gzip too slow, results in blocking) and start a lower resource job

Young-mok Jung · Answer 8 · Wed May 04 2022 10:02:52 GMT+0800 (China Standard Time)

We will update the mimalloc and readme within a few days! I'll notice it here as soon as it's updated.

That's an interesting point, thank you for sharing your tips.

Young-mok Jung · Answer 9 · Fri May 06 2022 09:39:48 GMT+0800 (China Standard Time)

My PR in Bioconda package was merged few hours ago, now BWA-MEME uses mimalloc as default (also the master branch).

Keiran Raine · Answer 10 · Fri May 06 2022 16:04:28 GMT+0800 (China Standard Time)

I'll run some tests and give findings on streaming data into samtools etc.

Keiran Raine · Answer 11 · Thu May 12 2022 18:44:14 GMT+0800 (China Standard Time)

1.0.4, Intel skylake, 32 threads, 200GB RAM allocated via scheduler. 30X WGS from paired fastq, bwa processing includes presence of *.fasta.alt

Real world use case:

#!/bin/bash -ue
rm -f sorttmp*
set -o pipefail
bwa-meme mem -7 -K 100000000 -t 32 \
 -R '@RG\tID:1\tLB:NA24385_lib\tPL:ILLUMINA\tSM:NA24385\tPU:R1_L1' \
 Homo_sapiens_assembly38.fasta NA24385-30X_1.fastq.gz NA24385-30X_2.fastq.gz \
 | samtools fixmate -m --output-fmt bam,level=0 -@ 1 - - \
 | samtools reheader -c 'grep -v ^@SQ > tmp-head && cat Homo_sapiens_assembly38.dict tmp-head' - \
 | samtools sort -m 1G --output-fmt bam,level=1 -T ./sorttmp -@ 32 - > sorted.bam

ALG	CPU(s)	WALL(s)	REAL	%CPU	peak_rss	peak vmem
bwa	305546.745	9966.03700000001	3h 16m 15s	2655.9%	43.5 GB	48.3 GB
mem2	184099.304	6197.36299999999	2h 6m 13s	2520.8%	60.3 GB	78.4 GB
meme	137476.289	5272.686	1h 53m 43s	2147.1%	164 GB	181.5 GB

CPU(s) and WALL(s) use the pattern matches from comment above.
REAL is the wall time for the full pipeline above as are peak_*

As you can see due to the larger fraction of runtime specific to the sort process as described above it is worth considering changing the process to minimise the CPU wastage, introducing additional steps in a workflow:

Map with rapid compression:

#!/bin/bash -ue
set -o pipefail
bwa-meme mem -7 -K 100000000 -t 32 \
 -R '@RG\tID:1\tLB:NA24385_lib\tPL:ILLUMINA\tSM:NA24385\tPU:R1_L1' \
 Homo_sapiens_assembly38.fasta NA24385-30X_1.fastq.gz NA24385-30X_2.fastq.gz \
 | lz4 -c1 > mapped.sam.lz4

Correct and sort 4cpu, 8GB:

#!/bin/bash -ue
set -o pipefail
lz4 -cd mapped.sam.lz4 \
 | samtools fixmate -m --output-fmt bam,level=0 -@ 1 - - \
 | samtools reheader -c 'grep -v ^@SQ > tmp-head && cat Homo_sapiens_assembly38.dict tmp-head' - \
 | samtools sort -m 1G --output-fmt bam,level=1 -T ./sorttmp -@ 4 - > sorted.bam

This would be essential should you want to introduce bwa-postalt.js to the flow (consumes bwa output ~1/3 required speed). If not the initial example is most likely sufficient, ~15 min extra run time on a 32 cpu instance. for 30x is mostly negated by the overhead of additional I/O, starting a new job with lower resource requirements and the additional 40-60 minutes of run time.

Young-mok Jung · Answer 12 · Thu May 12 2022 23:14:17 GMT+0800 (China Standard Time)

Thank you for your valuable tips and suggestions! We really appreciate it.

To summarize,

pipeline of (alignment, fixmate, reheader, sort) results in low CPU utilization (bottleneck on sorting operation).
An alternative way is to run alignment alone (with CPU fully utilized), and get the compressed alignment outputs to the low resource executor (i.e. container with 4 CPU, 8GB). But results in longer time, because of Disk I/O + container startup time.

I think the first option pipelining the whole thing is the best option in general. I would notice this result in README within few days.

Improving sorting throughput looks necessary for bwa-postalt.js pipeline. I will look into it and maybe come up with a solution in the future.

Keiran Raine · Answer 13 · Thu May 12 2022 23:17:31 GMT+0800 (China Standard Time)

FYI, bwa-meme has a peak efficiency of ~2500 when using 32 cores piped into bwa-meme ... | lz4 -c > mapped.sam.lz4.

bwa-postalt.js consumes the raw sam output from bwa.

Young-mok Jung · Answer 14 · Tue Oct 04 2022 15:45:15 GMT+0800 (China Standard Time)

Edit: Samtools sort have higher throughput in general, using mbuffer (same amount of buffer used for samtools sort) would resolve the write hang issue described below.

Hi, I recently looked into Samtools Sort code and found thing that might improve the bottleneck in the sorting step.

I found there is “Write hang” problem when using large memory buffer for Samtools Sort pipelined with BWA-MEME.

Samtools Sort works in two phases. phase 1 is done concurrently with BWA-MEME, phase 2 is executed after BWA-MEME alignment is finished.

Phase 1. Prepare small temporary bins. For given memory buffer (set by argument -m and -@), Samtools-Sort reads input and fills in the buffer. When memory buffer is full, divide the buffer by number of threads, sort it, and each thread writes temporary file (last processed bins are kept in memory without writing to files).

Phase 2. Open all temporary bins (all files from disk and memory), merge-sort the bins using heap (repeat write output, read temp file, heapsort) in single-thread (compression is multi-threaded).

The “Write hang” happens in BWA-MEME, when Samtools Sort is writing temporary files (phase 1). During the writing stage in Samtools Sort, it does not read input, which makes BWA-MEME write function to hang. In particular, when large memory buffer is used in Samtools Sort (e.g. 20G= -m1G x -@ 20 option), periodically at some point, BWA-MEME waits for the Samtools Sort process to write all memory buffer to temporary files (compressing and writing 20GB memory buffer to disk).

Simple solution is to use small memory buffer (-m 40m -@10 in my machine), but this can affect the time in phase 2, where using large memory can reduce the disk I/O. However, it can still improve the overall CPU usage in certain cases.

Below is my experiment results ((32 thread BWA-MEME, 20x paired read alignment, machine with HDD ~160MB/sec disk I/O).

using small memory buffer improves "write hang" problem. This hides Samtools-Sort phase 1 time inside pipeline of BWA-MEME.
Large memory can benefit Samtools Sort phase 2.

(second)	Time taken for disk I/O in MEME and Samtools Sort Phase1	Time taken for Samtools Sort Phase2
meme only	151.78
meme + Sort -m 2G -@ 10	1195.08	3047
meme + Sort -m 40m -@ 10	308.26	2954
meme + Sort -m 2G -@ 20	1035.83	2276
meme + Sort -m 40m -@ 20	402.19	3,091

used command like bwa-meme mem -t 32 -7 ... | samtools sort -m 2G -@20 -l1 -o ~/tmp/out.bam -