TITLE | AUTHORS |
---|---|
Unimod |
Yixin Zhao; Lingjie Liu; Adam Siepel |
Model-based characterization of the equilibrium dynamics of transcription initiation and promoter-proximal pausing in human cells
In our manuscript, we described our initial model and two types of extensions. One allows pause sites to vary across cells, and the other allows for both varied pause sites and steric hindrance of initiation at steady state. Here, we provide scripts of the implementation, and illustrate how we can use them to estimate initiation rates, pause release rates and landing pad occupancy for both synthetic and experimental data.
Fig. 1 The initial probabilistic model for transcription initiation, promoter-proximal pausing, and elongation
The unified model is implemented in the statistical programming language R, and depends on a couple of packages. One of the easiest ways to install them is via conda.
conda create -n unimod --file environment.yml
Once installed, you can activate the environment then run the examples within it,
conda activate unimod
Test data could be downloaded from here, and assumed to be placed within the data directory.
Usage: ./estimate_rates_simulation.R [options]
Estimate transcription rates based on simulated data
Options:
-h, --help
Show this help message and exit
-r CHARACTER, --rds=CHARACTER
Input file produced by SimPol [default NULL]
-s LOGICAL, --steric=LOGICAL
Infer landing-pad occupancy or not [default FALSE]
-d CHARACTER, --outputDir=CHARACTER
Directory for saving results [default .]
The input data is produced by SimPol, a simulator we developed for simulating the dynamics of RNA Polymerase (RNAP) on DNA template. One of the outputs from SimPol, pos.RDS, records the last 100 steps of the simulation, containing the information of RNAP positions in every cell. Therefore, we can utilize this information to sample cells, then sample read counts conditional on local RNAP frequency. We can later use this synthetic read counts to infer the transcription rates. The whole process is finished by doing
./estimate_rates_simulation.R -r ../data/k50ksd25kmin17kmax200l1950a1b1z2000zsd1000zmin1500zmax2500t40n20000s33h17_pos.RDS -d ../outputs/simulation/pause_escape
The prefix "k50ksd25kmin17kmax200l1950a1b1z2000zsd1000zmin1500zmax2500t40n20000s33h17" of the input file indicates the parameters we used in this test data set, which is further explained here. Under the given parameters, we simulated 20,000 cells in total for the equivalent of 40 min. (400,000 time slices). We then randomly sampled 5,000 of the 20,000 cells 50 times for each run. The output csv file contains the following columns:
- trail, refers to the number of run, from 1 to 50
- chi, the
$\chi$ estimates - beta_org, the
$\beta$ estimates from the initial model - beta_adp, the
$\beta$ estimates from the adapted model which allows pause sites to vary across cells
Details of the model and the simulation could be found in the method section here.
We can also use the same R script to infer landing-pad occupancy,
./estimate_rates_simulation.R -r ../data/k50ksd25kmin17kmax200l1950a1b1z2000zsd1000zmin1500zmax2500t40n20000s33h17_pos.RDS -s T -d ../outputs/simulation/steric_hindrance
And in this case, a fifth column phi will show up in the result. These are the
In this section, we will demonstrate how to use the model to estimate transcription rates on real data.
Usage: ./estimate_rates_experiment.R [options]
Estimate transcription rates based on experimental data
Options:
-h, --help
Show this help message and exit
-v, --verbose
Print messages [default]
-q, --quietly
Print no messages
--bwp=CHARACTER
Input bigwig file from the plus strand [default NULL]
--bwm=CHARACTER
Input bigwig file from the minus strand [default NULL]
--grng=CHARACTER
Gene regions for read counting [default NULL]
-s LOGICAL, --steric=LOGICAL
Infer landing-pad occupancy or not [default FALSE]
--scale=CHARACTER
A file provides scaling factors for omega [default NULL]
--type=CHARACTER
Scale omega based on [L]ow or [H]igh initiation rate [default L]
-d CHARACTER, --outputDir=CHARACTER
Directory for saving results [default .]
Like what we did in the simulation section, users need to download the test data first. The plus.bw and minus.bw are bigWig files recording PRO-seq read counts for the control samples from the heat shock dataset, which can be generated via the proseq2.0 pipeline. In addition, we also need pause and gene body regions for every gene in order to do the read counting. The "granges_for_read_counting.RData" saves these regions for analysis in K562 cells. Essentially, we used CoPRO-cap to precisely determine active TSS, then use these refined TSS to generate regions for read counting. Further details could be found in the "Analysis of Real Data" section here.
./estimate_rates_experiment.R --bwp ../data/PROseq-K562-vihervaara-control-SE_plus.bw --bwm ../data/PROseq-K562-vihervaara-control-SE_minus.bw --grng ../data/granges_for_read_counting.RData -d ../outputs/experiment/PROseq-K562-vihervaara-control-SE/pause_escape
After running this script, a csv file with following columns will be generated:
- gene_id, Ensemble gene id
- chi, the
$\chi$ estimates - beta_org, the
$\beta$ estimates from the initial model - beta_adp, the
$\beta$ estimates from the adapted model which allows pause sites to vary across cells - fk_mean, the mean position of pause sites
- fk_var, the variance of the position of pause sites
We can perform the same analysis on the heat shock treated samples,
./estimate_rates_experiment.R --bwp ../data/PROseq-K562-vihervaara-treated-SE_plus.bw --bwm ../data/PROseq-K562-vihervaara-treated-SE_minus.bw --grng ../data/granges_for_read_counting.RData -d ../outputs/experiment/PROseq-K562-vihervaara-treated-SE/pause_escape
We can also use the same R script to infer landing-pad occupancy
./estimate_rates_experiment.R --bwp ../data/PROseq-K562-vihervaara-control-SE_plus.bw --bwm ../data/PROseq-K562-vihervaara-control-SE_minus.bw --grng ../data/granges_for_read_counting.RData -s T --scale ../data/scale_factor.csv -d ../outputs/experiment/PROseq-K562-vihervaara-control-SE/steric_hindrance
In addition to the columns 1 to 6 above, four more columns will be included:
- phi,
$\phi$ estimates referring to the occupancy - omega_zeta,
$\omega$ is the effective initiation rate - beta_zeta,
$\beta$ is the pause escape rate - alpha_zeta,
$\alpha$ is the potential initiation rate
Note the last three columns are multiplied by the elongation rate,
Zhao, Y., Liu, L. & Siepel, A. Model-based characterization of the equilibrium dynamics of transcription initiation and promoter-proximal pausing in human cells. 2022.10.19.512929 Preprint at bioRxiv (2022).