achon / sramm

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

SRAMM

A command line utility to compute additional short read mapping metrics for unambiguous filtering and visualization.

Table of Contents

Features

  • Computes additional mapping metrics for aligned short read data (SAM/BAM/CRAM).
    • Mapped Identity (MI)
    • Unique Repeat (UR)
    • UnMapped (UM)
  • Fast and efficient: Uses 2+ threads and < 1GB RAM.
  • Compute statistics, filter, and generate graphs for the alignments of 10M reads in less than 7 minutes.
  • Can be parallelized at the program (increasing number of CPUs) or at the data level (splitting data).
  • Combinations of different concept based mapping metrics allows for unambiguous filtering for various downstream analyses.
  • Additionally works with VariantGraph. Ideal if population variant information is available.

Drawbacks

  • I/O bottleneck: Due to system caches/buffers and disk speed, increasing number of threads does not necessarily improve performance.
  • I/O component and parsing BAM entries results in a low compute/data ratio resulting in a flat run time dependent upon number of reads.
  • Relies on 'AS' tags which not all read aligners output.
  • 'AS' tag is heavily aligner dependent and thus SRAMM metric distributions may differ based upon alignment tool used.

Installation

Dependency requirements:

  • Python 2.6+ or 3.6+
  • Numpy (For general array functions, histograms)
  • pysam (For reading SAM/BAM/CRAM files)
  • matplotlib (For generating graphs)

Versions required should follow your Python installation.
Standard installation:

$ pip install numpy pysam matplotlib

Optionally you may need samtools and or picard.

$ sudo apt install samtools

Usage

SRAMM is intended as a command line utility and not a python package. Download or clone and run the self-contained python file.

$ python3 sramm.py -h
usage: SRAMM.py [options] process_type input_file

Short Read Alignment Mapping Metrics

positional arguments:
  process_type          Process type: stats, filter, graphs
  input_file            Input file, depends on process and input_type

optional arguments:
  -h, --help            show this help message and exit
  -output_file OUTPUT_FILE
                        Output stat file, default=sramm_output.txt
  -output_prefix OUTPUT_PREFIX
                        Output prefix for all files, default=''
  -end END              single or paired end, default=paired
  -num_reads NUM_READS  Number of reads worth of alignments to keep in memory, default=10000
  -num_cpu NUM_CPU      Number of cpu, default=2
  -graphs_flag GRAPHS_FLAG
                        Generate graphs flag, default=False
  -filter_flag FILTER_FLAG
                        Filter flag, default=False
  -mapq_max MAPQ_MAX    Max MAPQ value in dataset, default=60
  -score_num_bins SCORE_NUM_BINS
                        Number of bins for SRAMM scores for graphing
  -map_range MAP_RANGE  Range of the map score to filter (accept), default=0.0,1.0
  -uniq_rep_range UNIQ_REP_RANGE
                        Range of the uniq_rep score to filter (accept), default=0.0,1.0
  -unmap_range UNMAP_RANGE
                        Range of the unmap score to filter (accept), default=0.0,1.0
  -num_alns_range NUM_ALNS_RANGE
                        Range of the number of alignments to filter (accept), default=0.0,100.0

Visit https://github.com/achon/sramm for more information.

Test files:

Included is a small, 10k reads, BAM file of 101bp paired end reads from NA12878 with hg38 generated by BWA-MEM. Additionally, the output files are included. Files contained in test/:

bwa_10k_pe.bam
sramm_output.txt
sramm_output_filt.txt
graphs_hist_data.txt

The BAM file is generated by BWA-MEM using default parameters. Note. Input files must be sorted by read name and not position. collate is newer and supposedly faster, however sort should work even for older versions of samtools.

samtools collate <unsorted.bam> -o <sorted_by_readid.bam>

or

samtools sort -n <unsorted.bam> > <sorted_by_readid.bam>

A test file is included in this Github repository:

test/bwa_10k_pe.bam

Enter the directory with the test files.

Compute stats only:

$ python3 sramm.py stats bwa_10k_pe.bam

The provided BAM file is paired end reads and sorted by read ID. Output file will be sramm_output.txt. It has the following format by columns:

  1. read_id
  2. set of alignment scores or alignment score pairs
  3. number of alignment(s) or pair(s) of alignments
  4. MAPQ for each alignment or pair of alignments
  5. SRAMM metrics: Mapped Identity (MI), Unique-Repeat (UR), UnMapped (UM)
ERR174324.1	(99, 101)	1	60	0.9901,1.0,0.0099
ERR174324.2	(99, 100)	1	60	0.98515,1.0,0.01485
ERR174324.3	(99, 101)	1	60	0.9901,1.0,0.0099
ERR174324.4	(99, 100)	1	60	0.98515,1.0,0.01485
ERR174324.5	(99, 101)	1	60	0.9901,1.0,0.0099

Compute stats, filter, and generate graphs

$ python3 sramm.py stats bwa_10k_pe.bam -filter_flag=True -graphs_flag=True -map_range=0.9,1.0

This filters for reads that have a mapping score of at least 90% for the primary alignment based upon AS. It generates a filtered stats file, sramm_output_filt.txt. Lastly, it generates the 3 graphs and graph data file, graphs_hist_data.txt.

Using filtered reads

Once you have filtered your reads based upon your desired criteria, you can employ any number of methods to filter the reads for the next step of your analysis.

$ awk '{print $1}' sramm_output.txt > read_filtered.txt

You can then use samtools, picard, or a script to filter your BAM file.

java -jar picard.jar FilterSamReads I=bwa_10k_pe.bam O=bwa_10k_pe_filt.bam READ_LIST_FILE=read_filtered.txt FILTER=includeReadList

And then continue your analysis with your filtered BAM file.

VariantGraph output

Variant Graph does not natively report 'AS' tags in their BAM files. There is a 'refpos-table' output option using the '-v' flag that SRAMM has a parser for:

vg map -v ...

Then rename the file to have a .vg file extension that SRAMM will parse.

python3 sramm.py stats vg_10k_pe.vg

Use Cases

Below are some generalized use cases SRAMM was developed for.

Mapped Identity and Uniquely Mapped Reads

As mentioned in the discussion, the concept of uniquely mapped reads is not well standardized currently. Many pipelines use different aligners with various references for different analyses with different criteria. For example, many pipelines take reads with perfect MAPQ scores, which is 60 if using BWA-MEM or other common aligners. This however results in reads that are believed to map to one location under the concept of MAPQ, which is not necessarily what the end user wants. A read can be assigned a MAPQ of 60, but not have a perfect alignment (identity) score. Additionally, the paired read may have one read of the pair with many additional alternate and low scoring alignments, yet the pair will still have a MAPQ of 60. Or, one read of the pair maps to decoy sequences with varying scores, yet are discounted in MAPQ due to being a decoy sequence. Thus, this concept is not unambiguous nor well defined in a general use case setting.

By using SRAMM, we can enforce any range of mapped identity through MI scores and number of repeats through UR scores as well as through the sheer number of alignments. The user for example could use a MI score of 1.0 and a UR score of 1.0 with only 1 alignment pair reported along with a MAPQ of 60. This would truly select read pairs that are uniquely mapping in the sense that there are no alternate alignments, has a mapped identity of 1.0, and is not in a decoy sequence. Additionally, this may be too strict which may depend upon the reference. If the reference contains decoy sequences, then this is too strict as the decoy alignment will reduce the UR score and increase the number of alignments, however the MAPQ will still be 60. This highlights the further advantage to using multiple scores to remove ambiguity in read filtering. If using a common reference for a species, there will be individual differences. In that case, using an MI and UR of [0.95, 1.0] may be more appropriate with the number of alignments not set to 1. However, if using a population or variance aware reference, the MI of 1.0 may be desired depending upon how strict the selection criteria is and given the quality scores of the technology.

Another possibility is that we want uniquely mapped reads, but our definition is a bit less strict due to a lower quality reference genome. Also, this greatly depends upon how the short read aligner outputs alignments, alternate alignments, and the threshold for reporting those alignments. BWA-MEM for example will report alternate alignments for one read of a read pair. A user in this case can allow a lower UR, but still greater than 0.75. This means that it'll allow up to 1 alternate alignment weighted by the alignment score. A UR of 0.75 allows any 1 alternate alignment whereas a UR of 0.90 allows 1 alternate alignment with a very low alignment score.

Candidate Repetitive Reads

This is the opposite case of the above. Here, we're interested in reads with low UR scores. That is, read pairs with a primary alignment and alternate alignment(s) ideally with high alignment scores. This typically indicates how many possible alignments the pair has or directly through the number of alignments. Also, it can be combined with MI score to guarantee that the primary alignment is of a certain mapped identity. It can also be combined with MAPQ to remove decoy alignments as well. Additionally, it can be even further narrowed down by using a low UM score which would denote that the set of alignment scores are high. Do note this greatly depends on the alignment tool and the reporting of alternate alignments. Many tools do not report alternate alignments past a relatively high threshold or at all.

Mapped Identity and MAPQ - Decoy sequence mapping

If a given read pair has a set of perfect alignments to the reference in only one location, it will have the max MAPQ value eg 60 assuming the alignment has high identity and not in a decoy sequence. MAPQ is sensitive to where the alignment is in the reference and will assign 0 to a perfect alignment in a decoy sequence even if the alignment is a perfect match. By using MI with MAPQ, we can remove ambiguity when the MAPQ is 0 since it may be informative to know how well a read maps anywhere in the provided reference rather than just that it maps to a decoy sequence.

###Unmapped Reads Most read aligners themselves will give an option to report all reads even if there's no alignment while others will simply not report the reads at all. But, this reporting is based upon MAPQ and or alignment scores and there usually is an option to reduce the threshold and or report all alignments. Then, by using UM scores, we can identity reads with a set of low alignment scores and or multiple alignments. A user can identify the set of reads that have low MI scores, low UR scores, or directly high UM scores. This can be used in combination with a read aligner's reporting threshold to select the set of reads to that do not map well according to user criteria.

Reference

Chon, A. and Huang, X., 2021. Sramm: short read alignment mapping metrics. arXiv preprint arXiv:2107.02935. International Journal of Bioinformatics and Biosciences, 11(1), ½.

License

You are free to use the software in any way.

MIT license

Issues

To report bugs, visit the Issues page.

About


Languages

Language:Python 100.0%