bw2 / ExpansionHunter

A tool for estimating repeat sizes

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Description

This modified version of ExpansionHunter introduces the following changes:

  • adds an optional --locus arg for only processing a single locus from the variant catalog.
  • changes the Flanks can contain at most 5 characters N but found x Ns error to a warning.
    • This allows ExpansionHunter to run to completion without exiting on these loci and makes it easier to process large catalogs without having to find and exclude these loci first.
  • supports gzip-compressed input catalogs
  • supports direct access to remote bam/cram and fasta files from Google Cloud Storage or S3, so they don't have to be downloaded first.
    • for access to private buckets, set environment variable:
      GCS_OAUTH_TOKEN=$(gcloud auth application-default print-access-token)
    • for access to requester-pays buckets, set environment variable:
      GCS_REQUESTER_PAYS_PROJECT=<your gcloud project>
  • optimization of ExpansionHunter's "seeking" analysis mode that yields a 1.5x to 3x speed increase without changing the output.
    • it works by introducing an in-memory read cache that reduces the number of disk accesses required to retrieve mismapped mate pairs.
    • by default, the cache is reset after each locus, leading to a modest speedup with negligible memory overhead.
    • the new --cache-mates option activates reuse of the cache across loci, leading to a more significant speed increase, though at a cost of increased memory usage (typically in the range of 1-2GB of memory usage for catalogs with 100s to 1000s of loci).
    • if/when spliting a large variant catalog into multiple shards, it's important to presort the loci by their normalized motif (which is the cyclic shift of a motif that is alphabetically first - ie. AGC rather than CAG). This ensures that loci with the same motif will be processed in the same shard, increasing cache hit rates and therefore speed with this optimization.

Citation

If you use this modified version of ExpansionHunter, please cite:

Insights from a genome-wide truth set of tandem repeat variation
Ben Weisburd, Grace Tiao, Heidi L. Rehm
bioRxiv 2023.05.05.539588; doi: https://doi.org/10.1101/2023.05.05.539588

Expansion Hunter: a tool for estimating repeat sizes

There are a number of regions in the human genome consisting of repetitions of short unit sequence (commonly a trimer). Such repeat regions can expand to a size much larger than the read length and thereby cause a disease. Fragile X Syndrome, ALS, and Huntington's Disease are well known examples.

Expansion Hunter aims to estimate sizes of such repeats by performing a targeted search through a BAM/CRAM file for reads that span, flank, and are fully contained in each repeat.

Linux and macOS operating systems are currently supported.

License

Expansion Hunter is provided under the terms and conditions of the Apache License Version 2.0. It relies on several third party packages provided under other open source licenses, please see COPYRIGHT.txt for additional details.

Documentation

Installation instructions, usage guide, and description of file formats are contained in the docs folder.

Companion tools and resources

  • A genome-wide STR catalog containing polymorphic repeats with similar properties to known pathogenic and functional STRs
  • REViewer, a tool for visualizing alignments of reads in regions containing tandem repeats

Method

The method is described in the following papers:

About

A tool for estimating repeat sizes

License:Apache License 2.0


Languages

Language:C++ 97.0%Language:CMake 1.7%Language:Dockerfile 0.8%Language:Shell 0.4%Language:Makefile 0.1%