TITLE	AUTHORS
Unimod	Yixin Zhao; Lingjie Liu; Adam Siepel

Model-based characterization of the equilibrium dynamics of transcription initiation and promoter-proximal pausing in human cells

Overview

In our manuscript, we described our initial model and two types of extensions. One allows pause sites to vary across cells, and the other allows for both varied pause sites and steric hindrance of initiation at steady state. Here, we provide scripts of the implementation, and illustrate how we can use them to estimate initiation rates, pause release rates and landing pad occupancy for both synthetic and experimental data.

Fig. 1 The initial probabilistic model for transcription initiation, promoter-proximal pausing, and elongation

Dependencies

The unified model is implemented in the statistical programming language R, and depends on a couple of packages. One of the easiest ways to install them is via conda.

conda create -n unimod --file environment.yml

Once installed, you can activate the environment then run the examples within it,

conda activate unimod

Test data could be downloaded from here, and assumed to be placed within the data directory.

Examples

Estimate rates based on simulated data

Usage: ./estimate_rates_simulation.R [options]
Estimate transcription rates based on simulated data

Options:
	-h, --help
		Show this help message and exit

	-r CHARACTER, --rds=CHARACTER
		Input file produced by SimPol [default NULL]

	-s LOGICAL, --steric=LOGICAL
		Infer landing-pad occupancy or not [default FALSE]

	-d CHARACTER, --outputDir=CHARACTER
		Directory for saving results [default .]

The input data is produced by SimPol, a simulator we developed for simulating the dynamics of RNA Polymerase (RNAP) on DNA template. One of the outputs from SimPol, pos.RDS, records the last 100 steps of the simulation, containing the information of RNAP positions in every cell. Therefore, we can utilize this information to sample cells, then sample read counts conditional on local RNAP frequency. We can later use this synthetic read counts to infer the transcription rates. The whole process is finished by doing

./estimate_rates_simulation.R -r ../data/k50ksd25kmin17kmax200l1950a1b1z2000zsd1000zmin1500zmax2500t40n20000s33h17_pos.RDS -d ../outputs/simulation/pause_escape

The prefix "k50ksd25kmin17kmax200l1950a1b1z2000zsd1000zmin1500zmax2500t40n20000s33h17" of the input file indicates the parameters we used in this test data set, which is further explained here. Under the given parameters, we simulated 20,000 cells in total for the equivalent of 40 min. (400,000 time slices). We then randomly sampled 5,000 of the 20,000 cells 50 times for each run. The output csv file contains the following columns:

trail, refers to the number of run, from 1 to 50
chi, the $\chi$ estimates
beta_org, the $\beta$ estimates from the initial model
beta_adp, the $\beta$ estimates from the adapted model which allows pause sites to vary across cells

Details of the model and the simulation could be found in the method section here.

We can also use the same R script to infer landing-pad occupancy,

./estimate_rates_simulation.R -r ../data/k50ksd25kmin17kmax200l1950a1b1z2000zsd1000zmin1500zmax2500t40n20000s33h17_pos.RDS -s T -d ../outputs/simulation/steric_hindrance

And in this case, a fifth column phi will show up in the result. These are the $\phi$ estimates referring to the occupancy.

Estimate rates based on experimental data

In this section, we will demonstrate how to use the model to estimate transcription rates on real data.

Usage: ./estimate_rates_experiment.R [options]
Estimate transcription rates based on experimental data

Options:
	-h, --help
		Show this help message and exit

	-v, --verbose
		Print messages [default]

	-q, --quietly
		Print no messages

	--bwp=CHARACTER
		Input bigwig file from the plus strand [default NULL]

	--bwm=CHARACTER
		Input bigwig file from the minus strand [default NULL]

	--grng=CHARACTER
		Gene regions for read counting [default NULL]

	-s LOGICAL, --steric=LOGICAL
		Infer landing-pad occupancy or not [default FALSE]

	--scale=CHARACTER
		A file provides scaling factors for omega [default NULL]

	--type=CHARACTER
		Scale omega based on [L]ow or [H]igh initiation rate [default L]

	-d CHARACTER, --outputDir=CHARACTER
		Directory for saving results [default .]

Like what we did in the simulation section, users need to download the test data first. The plus.bw and minus.bw are bigWig files recording PRO-seq read counts for the control samples from the heat shock dataset, which can be generated via the proseq2.0 pipeline. In addition, we also need pause and gene body regions for every gene in order to do the read counting. The "granges_for_read_counting.RData" saves these regions for analysis in K562 cells. Essentially, we used CoPRO-cap to precisely determine active TSS, then use these refined TSS to generate regions for read counting. Further details could be found in the "Analysis of Real Data" section here.

./estimate_rates_experiment.R --bwp ../data/PROseq-K562-vihervaara-control-SE_plus.bw --bwm ../data/PROseq-K562-vihervaara-control-SE_minus.bw --grng ../data/granges_for_read_counting.RData -d ../outputs/experiment/PROseq-K562-vihervaara-control-SE/pause_escape

After running this script, a csv file with following columns will be generated:

gene_id, Ensemble gene id
chi, the $\chi$ estimates
beta_org, the $\beta$ estimates from the initial model
beta_adp, the $\beta$ estimates from the adapted model which allows pause sites to vary across cells
fk_mean, the mean position of pause sites
fk_var, the variance of the position of pause sites

We can perform the same analysis on the heat shock treated samples,

./estimate_rates_experiment.R --bwp ../data/PROseq-K562-vihervaara-treated-SE_plus.bw --bwm ../data/PROseq-K562-vihervaara-treated-SE_minus.bw --grng ../data/granges_for_read_counting.RData -d ../outputs/experiment/PROseq-K562-vihervaara-treated-SE/pause_escape

We can also use the same R script to infer landing-pad occupancy $\phi$, but in this case, we will have to scale the effective initiation rate, $\omega$, as we discussed in our manuscript. We provide a file ("scale_factor.csv") containing the scaling factors we precomputed for K562 cells.

./estimate_rates_experiment.R --bwp ../data/PROseq-K562-vihervaara-control-SE_plus.bw --bwm ../data/PROseq-K562-vihervaara-control-SE_minus.bw --grng ../data/granges_for_read_counting.RData -s T --scale ../data/scale_factor.csv -d ../outputs/experiment/PROseq-K562-vihervaara-control-SE/steric_hindrance

In addition to the columns 1 to 6 above, four more columns will be included:

phi, $\phi$ estimates referring to the occupancy
omega_zeta, $\omega$ is the effective initiation rate
beta_zeta, $\beta$ is the pause escape rate
alpha_zeta, $\alpha$ is the potential initiation rate

Note the last three columns are multiplied by the elongation rate, $\zeta$, which is assumed to be 2,000 bp/min. So all these columns have the absolute unit "events per minute".

Citation

Zhao, Y., Liu, L. & Siepel, A. Model-based characterization of the equilibrium dynamics of transcription initiation and promoter-proximal pausing in human cells. 2022.10.19.512929 Preprint at bioRxiv (2022).

CshlSiepelLab / UniMod