Code for "A unified genealogy of modern and ancient genomes"
This repo contains code used in "A unified genealogy of modern and ancient genomes". This includes:
- simulation based validation of tsinfer and tsdate
- pipelines to generate a unified, dated tree sequence from multiple datasets
- empirical validation of the tree sequence using ancient genomes
- code to produce all non-schematic figures in the paper, as well as the interactive figure and supplementary video
These analyses are placed in subdirectories as follows:
all-data
contains all code for downloading and preparing real data, as well as the pipeline for creating tree sequencessrc
contains scripts to run analyses and validation on both real and simulated datadata
contains reference data needed for some analyses and is where the results of analyses performed on the real inferred tree sequences are stored. It is also where simulations evaluating mismatch parameters in tsinfer are performed.simulated-data
contains the results of all simulation-based analysestools
contains methods used to compare the accuracy oftsinfer
andtsdate
as well as tools to process variant data files
You must first clone this repo, including submodules as follows:
$ git clone --recurse-submodules https://github.com/awohns/unified_genealogy_paper.git
$ cd unified_genealogy_paper
Required Software
Installing required python modules
The Python packages required are listed in the requirements.txt
file. These can be
installed with
$ python -m pip install -r requirements.txt
Please use Python 3.8 (`numba' fails to install using pip with Python 3.9).
cartopy
(required for generating figures with maps) and ffmpeg
(required to create the movie) must be installed
separately with conda as follows:
$ conda install -c conda-forge cartopy
$ conda install -c conda-forge ffmpeg
Required software for preparing real data
We require BCFtools, SAMtools, and HTSlib, as well as
convertf, Picard,
eigensoft and plink
to prepare the variant data files from
the 1000 Genomes Project, Human Genome Diversity Project, Simons Genome Diversity Project, and ancient DNA
datasets for use with tsinfer
and tsdate
.
We compare our methods with Genealogical Estimation of Variant Age (GEVA) and
Relate.
All of these tools are kept in the tools
directory and can be downloaded and built using
$ cd tools
$ make all
Inferring Tree Sequences from Real Data
Please see the README in the all-data
directory for details on inferring the combined tree sequence.
Running Simulations
To generate data required for simulation-based figures and to plot the figures themselves, follow this general process: generate data, run analyses, and plot the results. The following example will generate Figure 1c:
$ python src/run_evaluation.py tsdate_neutral_sims --setup
$ python src/run_evaluation.py tsdate_neutral_sims --inference
$ python src/plot.py tsdate_neutral_sims
The first command runs the simulations for each evaluation. The simulations are stored in the simulated-data
directory
The second command performs inference on the results of the simulations. This will take multiple
days to run for all simulations in the paper. The results are stored as
csv files in the simulated-data
directory. The third command plots the csv files and saves the resulting figures to the
figures
directory.
Running all evaluations
To produce all the simulation data in our paper, run the following, in order
$ python src/run_evaluation.py --setup all
$ python src/run_evaluation.py --infer all
You can speed up the inference step by using multiple processors, specified using the -p
flag.
For instance, on a 64 core machine, using all cores:
$ python src/run_evaluation.py infer -p 64 all # will take a few days
The mismatch simulations are produced separately, using the Makefile in the data
directory:
$ cd data
$ make mismatch
Analyzing inferred tree sequences
Once you have inferred tree sequences and the results are in the all-data
directory, run functions in src/analyze_data.py
to generate data for non-simulation based figures. The figures themselves are plotted using src/plot.py
Before running these analyses, please download age estimates for 1000 Genomes variants from Relate'' and
GEVA'' using the following:
$ cd data
$ make relate_ages
$ make geva_ages
You can then run all analyses using real data with:
$ python src/analyze_data.py all
Plotting figures
The final figures can then be plotted using
python src/plot.py all