Pipeline to identify accelerated regions given a multiple alignment of genomes.
Updated to Nextflow DSL 2 in Feb. 2024. Please report any bugs you find by opening an issue. Thanks!
Please refer to Pollard et al. 2006, Lindblad-Toh et al. 2011 and Hubisz & Pollard 2014 for more information on accelerated regions and methods behind this pipeline. For more information on Nextflow please refer to its documentation.
How to call accelerated regions with this pipeline:
1.) clone the repo
2.) install the required packages (see wiki)
3.) make a copy of and adapt the project file (example: ARs_config_20way_example_human.yml
) to fit your project goals
- species: a newline-delimited list of species in a text file for inclusion in the analysis, this can be a subset of or all the species in your multiple alignment
- maf_path: the path to your multiple alignment files which must be in MAF format
- phast_path: path to your installation of PHAST, this is specified because we recommend using the Github version of PHAST and this way you don't have to alter your $PATH
- outdir: where you want the outputted files to be saved
- init_tree: a rooted, bifurcating, Newick-formatted tree
- species_of_interest: the species in which you want to identify lineage-specific accelerated elements
- chrom_bed_path: BED files specifying the size of each chromosome analyzed in the reference frame of the MAF reference sequence, these should be named as e.g. "chrX.bed"
- synteny_filter_file: BED-formatted (not gzipped) file syntenic regions that will be intersected with the phastCons
- ar_filters_path: path to a gzipped BED file containing anything you want to exclude from the accelerated region analysis, e.g. repetitive elements, blocklists, etc.
- max_p: Maximum Benjamini-Hochberg-adjust p-value you would like to consider for accelerated regions (the scored phastCons elements are also returned so you can change your mind about this value posthoc)
- random_seed: random seed used for the phyloP step for reproducibility of results
- target_coverage: phastCons parameter (aka gamma) that indicates the percent of the query genome expected to be conserved
- expected_length: phastCons parameter (aka omega) that indicates the expected length of conserved elements
- rho: phastCons scaling parameter indicating how to scale the neutral tree to obtain the conserved tree (see Siepel et al. 2005 for more details)
- auto_neutral_model: path to the neutral model for autosomes (not ending in "X", "Y", or "M", so this may need to be adjusted for some species) in ".mod" format
- nonauto_neutral_model: path to a directory containing the neutral models for non-autosomes (ending in "X", "Y", or "M", so this may need to be adjusted for some species) in ".mod" format, these should be named e.g. "chrX.mod"
- min_decile: length-normalized log odds score threshold for phastCons elements, a higher decile indicates higher conservation (in general) see Siepel et al. 2005 and phastCons documentation for more details
4.) adjust the nextflow config file to match your operating environment (we have provided a sample config for an SGE system)
5.) run the pipeline (sample command: nextflow run call_ARs.nf -w "hars_workdir/" -profile local -params-file ARs_config_20way_example_human.yml
)
The example files indicated are for the UCSC 20-way primate alignment, which runs on a 2020 MacBook Pro in a few minutes, and is good for testing.
Outputs
If the pipeline successfully runs, it will output two files, final_ARs_(random_seed).bed and scored_phastCons_(random_seed).txt, which are a BED file of the accelerated regions and a tab-separated file of the filtered, acceleration-scored phastCons which can be useful in statistical analyses as null models or if you want to reselect HARs at a different FDR. A bunch of files are also outputted by the intermediate steps, which you can turn off if desired.