This repository contains the snakemake pipeline for generating the Genome-in-a-Bottle stratifications. This pipeline is designed to be reference agnostic. For the complete stratifications generated for existing references such as CHM13 or GRCh38, see here.
Note that this pipeline only applies to stratifications starting with version v3.2. For older versions, please see here for the equivalent methodological reference.
All stratifications can be generated using a snakemake pipeline from a reference fasta file and accompanying set of source files.
The following stratication categories can be generated programmatically by this pipeline:
- Low Complexity
- Segmental Duplications
- XY
- Functional
- GC Content
- Mappability
- Telomeres
- Union
For other categories (such as ancestry) that may be derived from less standardized means, it is possible to include raw bed files as-is. In practice, this feature is often used for stratifications that are difficult to create and that exist in previous versions (assuming they don't need to change).
The only software requirement to install dependencies is conda
(or mamba
).
From the root of this repo, run the following:
mamba env create -f env.yml
This will create and env called giab-strats
. Activate it with:
conda activate giab-strats
If only generating stratifications for a few small chromosomes (as is the case for testing and developing), the entire pipeline can be run in several minutes on a consumer-grade laptop (8 cores, 16GB RAM). The only exception is generating mappability (which runs GEM) which will require 10-20 minutes to complete for even a small chromosome (like chr21).
TODO add cluster requirements
An example configuration can be found in config
(currently used for testing
and continuous integration). This contains two files: testing.yml
and
testing-full.yml
which are used during integration testing and double as
example configurations. testing.yml
is documented such so as to serve as a
starting point for users who wish to run this pipeline themselves.
Conceptually, the configuration is organized into two levels: stratifications
and builds
. A stratification
encodes the reference FASTA file along with
associated files used to generate the stratifications (which may be either
remote URLs are local files depending on what is available). Each
stratification
has one or more build
s which specify which stratifications
levels (eg LowComplexity, GCContent, etc) and chromosomes to include in the
final output.
To run the pipeline, use the following command (assuming the environment is set up):
snakemake \
--use-conda \
-c <number_of_cores> --rerun-incomplete \
--configfile=path/to/your/config/config.yml \
all
Note that if any source files are specified as local in the configuration, they must exist or the pipeline will refuse to run.
All output will either be in resources
(downloaded files) or results
(processed files). For most users the only important files will be in
results/final
, which in turn has each build and reference specified as
reference_key@build_key
(see configuration section and config/testing.yml
for more details on what these mean). Within each of these files are the
stratification BED files and associated metadata.