IBDmix

Detecting archaic sequence segregation in modern human genomes

Background

Admixture has played a prominent role in shaping patterns of human genomic variation, including gene flow with now extinct hominins like Neanderthals and Denisovans. We describe a novel probabilistic method called IBDmix to identify introgressed hominin sequences, which unlike existing approaches, does not use a modern reference population.

Usage

IBDmix consists of 3 steps:

Generate custom genotype file
Detect introgressed regions
Sort and filter results

Compilation

The provided CMake lists will automatically build the executables, but requires version 3.13 or greater, git and the repository to be cloned. To use:

git clone https://github.com/PrincetonUniversity/IBDmix.git
cd IBDmix
pwd
# /path/to/repo/IBDmix
mkdir build
cd build
cmake ..
cmake --build .

The executables generate_gt and ibdmix will be left in IBDmix/build/src.

Usage Details

Note that all chromosomes must be integers in the input vcfs and masked bed files. Vcf and mask files should be split by chromosome. See the included snakefile as an example pipeline implementation.

Generate Genotype

generate_gt has the following options:

-h, --help Print the help information and exit.
-a, --archaic The archaic vcf file. May contain multiple archaic samples but only one will be utilized for IBDmix. Must be uncompressed text.
-m, --modern The modern vcf file. Must be uncompressed text.
-o, --output The merged genotype file output. Written as uncompressed text. All files must be specified to run.

IBDmix

ibdmix takes the following options:

-h, --help Print the help information and exit.
-g, --genotype The input genotype file produced by generate_gt. Required.
-o, --output The output file. Format is tab-delimited text with columns for individual ID, chromosome, start, end, and LOD score. The regions are half open with [start, end), where the end position is the next position in the input genotype file. Required.
-s, --sample File specifying which samples to consider. One individual per line, must match header in genotype file exactly. If not specified all samples will be utilized (all columns except archaic).
-n, --archaic Name of archaic individual. Must match column name. If not specified, taken as the first column of the genotype file (default output order for generate_gt).
-r, --mask Bed file of masked regions to include in the analysis. If not specified all sites are considered.
-d, --LOD-threshold Threshold value of log(odds) for emitting regions. Default: 3.0
-m, --minor-allele-count-threshold Threshold count for filtering minor alleles. For each site, the number of occurrences in the sample population is tabulated. If fewer than this count are found, the LOD score for the site is taken as 0. Default: 1
-a, --archaic-error Allele error rate for archaic DNA. Default: 0.01
-e, --modern-error-max Maximum allele error rate for modern humans. Default: 0.0025
-c, --modern-error-proportion Ratio between allele error rate and minor allele frequency for modern humans. The error rate for a given site is taken as the minimum of the -e option and the product of this value with the allele frequency. Default: 2
-i, --inclusive-end A switch to change the default behavior of the end position. By default, the reported end will be the next position in the genotype file which produces a decreasing LOD. It corresponds to the region of [start, end). With inclusive end set, the reported end is the last position with increasing LOD, or [start, end].
-t, --more-stats A switch to report additional region-level statistics. Additional columns include:
- sites: total number of sites in genotype file in region
- positive_lods: number of sites with positive LOD scores in the region
- negative_lods: number of sites with negative LOD scores in the region
- mask_and_maf: sites which are both in the masked region and fail MAF cutoff
- in_mask: sites in the mask but pass MAF
- maf_low: sites failing MAF with low frequency
- maf_high: sites failing MAF with high frequency
- rec_2_0: sites failing mask or MAF but still considered due to a genotype of 2 in archaic and 0 in modern
- rec_0_2: same as rec_2_0 but with 0 in archaic and 2 in modern
-w, --write-snps Include positions with positive LOD scores as a comma-separated list.
--write-lods Include LOD scores of positive sites as a comma-separated list. Same order as write-snps output (e.g. zip the two entries to get position/lod values).

Summary.sh

Once a run of ibdmix completes, it is informative to filter the results on a range of LOD values and length cutoffs. It is faster to perform this operation on the ibdmix output than to rerun with different options.

summary.sh takes up to five options in order:

length cutoff The minimum length to emit a region. Use 0 for all.
LOD cutoff The minimum LOD score to emit a region. Use 0 for all.
population The population to label each row.
input Optional, the uncompressed input file.
output Optional, the uncompressed output file. Contains only the regions passing both filters, sorted by ID then start position of the region.

If neither input nor output is specified, standard input and output are utilized. To specify only one value, use - to indicate standard input/output. For example, to filter length greater than 1000 and LOD greater than 5 from ibd_output.txt and generate compressed output.gz

summary.sh 1000 5 population ibd_output.txt - | gzip > output.gz

Snakemake Workflow

The above workflow is automated through snakemake. In the provided snakefiles/Snakefile, all steps from compilation to summary generation are encoded. By default, summary files will be generated but the workflow can be customized by setting options in snakefiles/config.yaml. All files are compressed with gzip.

The easiest way to get started is to set your paths in the config file and run snakemake in the snakefiles directory. If mask statistics are generated, bedtools needs to be installed and included in PATH, otherwise run snakemake with the --use-singularity flag to download and use a docker container instead.

If a cmake module is present, adding the --use-envmodules will activate it prior to compilation. Due to the dependence on envmodules, the snakefile requires snakemake >= 5.10.

Important Configuration Options

A note on wildcards.

Wildcards are specified in snakemake by enclosing them in curly braces. When changing options in the config file, it is important that all wildcards are kept and spelled correctly. For example, say you want to change the summary output, which is currently altai_1kg_{population}_{chrom}.gz. As long as {chrom} and {population} are present, the rest can change. If you want to place each population in a separate directory, this could be changed to {population}/altai_1kg_chr_{chrom}.txt.gz. The output files would look like ASN/altai_1kg_chr_20.txt.gz. However, {pop}/altai_1kg_{chr}.gz will throw an error because snakemake expects to find specific wildcards.

compiler and options: These are utilized to make the executables from cpp source code. The compiler requires C++ 11 standard.
exe_root: Where to place compiled executables.
source_root: The path to the IBDmix directory.
archaic_vcf, modern_vcf: Required inputs
genotype_file, ibd_output: Temporary intermediate files
ibd_summary: Final output. If no summary values are supplied, ibd_output is the final target instead.
sample_file: The input sample files for ibdmix. If removed, all samples will be used and 'ALL' will be the population name.
mask_file: The bed file to use for each chromosome. If removed, no masking will be performed during IBDmix.
IBDmix options: All parameters for performing ibdmix. The genotype, output, mask, and sample are handled automatically. Other options can be supplied here (including archaic sample name, more stats and inclusive end).
IBDmix mask_stats: If set to True and a mask file is provided, additional columns will be determined including the total number of BP masked in each region and the largest masked area within each region. This values can be used for further filtering.
IBDmix summary_lod: List of all LOD values for filter. If sorted output is desired, a value of 0 will keep all regions. Remove this entry to just produce the raw IBD files.
IBDmix summary_length: List of all length values for filter. If sorted output is desired, a value of 0 will keep all regions. All permutations of length and lod are produced.

Useful Snakemake Options

--notemp: Without this option, genotype files and raw IBD calls will be automatically deleted when they are no longer needed by the workflow. If you expect to perform multiple executions utilize this flag to keep those files around.
--delete-temp-output: Once you are done with temp files this will find and delete all specified by the current config.yaml
--cores: Specify the number of threads available to snakemake. With more cores, more jobs can be run simultaneously.
--dryrun: See what rules will run without running any of them.

License

GNU GPLv3

troycomi / IBDmix

IBDmix

Background

Usage

Compilation

Usage Details

Generate Genotype

IBDmix

Summary.sh

Snakemake Workflow

Important Configuration Options

Useful Snakemake Options

License

About

Languages