This repository contains scripts used to drive the experiments and compile the figures and tables for the manuscript "Scaling read aligners to hundreds of threads on general-purpose processors." All relevant scripts are in the thread_scaling/scripts
subdirectory.
Links for downloading the reads are in Supplementary Note 2. The process of generating the reads involves downloading the source read files, sampling 100M reads from each, and randomizing the overall order of the reads.
The scripts we used to generate these shuffled samples are in:
reads.py
The scripts use to submit SLURM jobs to run this script releatedly and then concatenate the results are in:
reads.sh
reads_cat.sh
Read file sizes were measured with ls -l
and these are reported in Supplementary Table 2.
Running times for all thread counts and for every combinations of (a) configuration (aligner and arguments), (b) system (KNL or Broadwell), and (c) paired-end status were performed and results are shown in Figures 3-5, Tables 2-4 and Supplementary Figures 1-3. Important scripts driving this process are:
master.py
master script for driving one or more configurations through a complete series of tests. Handles building the various configurations with appropriate preprocessor macros. Also handles preparing the read files for each run, conducting the runs, runningtop
and/oriostat
in the background during runs to collect system measurements, and killing runs when the time limit is exceeded.stampede_knl/*.sh
SLURM scripts for driving all the KNL-based configurations. These scripts depend on and invotecommon.sh
.marcc_lbm/*.sh
SLURM scripts for driving all the Broadwell-based configurations. These scripts depend on and invotecommon.sh
.
Important configuration files governing these experiments are in .tsv
files. Each line of each file defines the repository, tag, preprocessor macros, aligner command-line arguments, and multithreading/multiprocessing balances to use for a configuration. Specifically:
bt_base.tsv
defines the configurations for the Bowtie lock-type experiments described in Figure 3/Table 2.bt.tsv
defines the configurations for all other Bowtie experiments, as described in Figures 4 and 5 and Tables 3 and 4.bt2_base.tsv
likebt_base.tsv
but for Bowtie 2.bt2.tsv
likebt.tsv
but for Bowtie 2.ht_base.tsv
likebt_base.tsv
but for HISAT.ht.tsv
likebt.tsv
but for HISAT.bwa.tsv
defines the configurations for the BWA-MEM experiments described in Figure 5/Table 4.
These configurations are also described in Supplementary Note 1.
The thread count series used in the experiments are in:
marcc_lbm/thread_series.txt
for all Broadwell seriesstampede_knl/thread_series.txt
for all KNL series
The KNL and Broadwell experiments write results to the stampede_knl/results
and marcc_lbm/results
subdirectories. These are tabulated into CSV files using the script:
tabulate.py
These scripts are then used as inputs to the scaling_results.Rmd
R Markdown notebook. We then run the R Markdown notebook to generate all the thread scaling plots. The find the code for generating these plots, look in the following named code blocks in scaling_results.Rmd
:
baseline_plots_all
baseline_plots_all_unp
baseline_plots_all_pe
parsing_plots_all
parsing_plots_all_unp
parsing_plots_all_pe
final_plots_all
final_plots_all_unp
final_plots_all_pe
Using the same data used to generate Tables 2-4 and Supplementary Tables 1-3, we used the peak_throughput_table
code block in the thread_scaling/scripts/scaling_results.Rmd
R Markdown notebook to compile a master table giving the peak throughput for every combination of configuration, system and paired-end status.
Since top
is run in the background during thread scaling experiments, we can parse the top
log to find the peak resident set size, as plotted in Supplementary Figure 4. The script for doing this is:
thread_scaling/scripts/peak_res.py
The number of reads per thread used in each experiment as shown in Supplementary Table 1 were determined manually, with the goal of making all runs last a minute or longer. These numbers were then coded into the scripts in the thread_scaling/scripts/stampede_knl
for the KNL experiments and thread_scaling/scripts/marcc_lbm
for the Broadwell experiments.
check_blocked.py
sanity-checks a file with padding appropriate for L-parsing.get_reads.sh
downloads all the read files at the links shown in Supplementary Note 2. They are downloaded compressed and you will have to decompress before running the experiments.