phemulator

Background

Phemulator is a tool for simulating phenotypes on top of real-world genotyping or sequencing data. Through a number of parameters, user controls the genotype-phenotype relationship, i.e. the tool allows for simulating desired genetic architectures (currently limited to additive effects only) using real population data. The Phemulator tool has been developed by Marcin Kierczak, National Bioinformatics Infrastructure, Sweden (NBIS) in a partner project with research group led by prof. Åsa Johansson, IGP, Uppsala University, Sweden.

Motivation

We wanted a tool that will enable us to simulate phenotypes based on existing genomic data and evaluate different tools that are widely used for genome-wide association studies (GWAS), including classical tools that evaluate effects of single common variants as well as so-called burden tests or kernel association tests used for discovering rare variants associated to phenotypes. This need has emerged in our previous work, e.g.:

Kierczak, M., Rafati, N., Höglund, J. et al. Contribution of rare whole-genome sequencing variants to plasma protein levels and the missing heritability. Nat Commun 13, 2532 (2022). https://doi.org/10.1038/s41467-022-30208-8

What Phemulator can do for me?

you need your genotypes for a number of individuals in bgen-1.2 8-bit format.
you need a number of regions, e.g. CDS defined in a bed file.
you next can run our containerized Shiny app that will help you selecting simulation parameters,
you decide how many common and how many rare alleles in a region (e.g. CDS) contribute to phenotype and how much (effect size),
based on this, the tool scans your genome region by region and if there are enough common/rare variants, it simulates a phenotype
if you run a number of simulations, all parameters and outputs will be saved as json/csv
you can use your simulated phenotypes to evaluate different association models using, e.g. an excellent rvtests tool

Containerized Shiny app

You need to have Docker installed on your machine. Next, in Terminal you execute:

docker run --rm -p 8787:8787 nbisweden/phemulator:v1.0.0

then, go to browser and type http://0.0.0.0:8787 Now, you can play with parameters and select your simulation parameters.

Simulation Tool User Documentation phemulator.py

Introduction

The Phemulator Simulation Tool is a Python script designed for simulating phenotypes based on genetic variants and user-defined settings. This documentation will guide you through the usage of the tool, its features, and how to customize the simulation.

Installation
Usage
Command-Line Arguments
Settings Configuration
Output

Installation

Before you can use the tool, you need to make sure you have the required dependencies installed:

Python (3.6 or higher)
PyBGEN
NumPy
Pandas

You can install the required Python packages using pip:

pip install pybgen numpy pandas

Usage

To use the Phemulator Simulation Tool, follow these steps:

Open your terminal or command prompt.
Navigate to the directory containing the phemulator.py script.
Run the script with the desired command-line arguments (explained below).

For example:

python phemulator.py --name my_simulation --bed_regions_path my_regions.bed --bgen_file_path my_data.bgen

The tool will start the simulation based on the provided settings and input data.

Command-Line Arguments

The Simulation Tool accepts several command-line arguments to customize the simulation. Here are the available options:

--name: Name of the simulation (default: an autogenerated name).
--bed_regions_path: Path to the BED file containing region data (default: "cds_test.bed").
--bgen_file_path: Path to the BGEN file (default: "Rum_recoded_repos_norel_rnd3000_chr22.bgen").
--threshold: Threshold for MAF (Minor Allele Frequency) distinguishing rare vs. common variants (default: 0.05).
--my_chr: Chromosome of interest (default: "1").
--num_common: Number of common variants for simulation (default: 1).
--num_rare: Number of rare variants for simulation (default: 2).
--rare_eff_mean: Mean effect size for rare variants (default: 2).
--rare_eff_std_dev: Standard deviation of effect size for rare variants (default: 0.3).
--common_eff_mean: Mean effect size for common variants (default: 0.3).
--common_eff_std_dev: Standard deviation of effect size for common variants (default: 0.01).
--err_mean: Mean for the error term in phenotype simulation (default: 0.05).
--err_sd: Standard deviation for the error term in phenotype simulation (default: 0.01).
--num_sim: Number of simulations to perform (default: 1).
--out: Path to where the output should be saved (default: "./out").

Settings Configuration

Settings can also be configured directly within the script using the SettingsSingleton class. You can modify the default settings in the init_settings method of the SettingsSingleton class.

# Example of modifying settings
S.threshold = 0.1
S.num_common = 2
S.num_rare = 3

Output

The tool will generate output files including:

JSON files containing region information and variants data.
A CSV file containing simulated phenotype data.
You can find these files in the specified output directory.

Have a look at our example inside the data/ and follow README to better understand the output.

That's it! You can use the Phemulator Simulation Tool to generate simulated phenotypes based on your genetic data and custom settings.

juliahoglund / phemulator