cell-type-annotation combinatorial-regulation tf-combination tf-module

cRegulon

Introduction

This is cRegulon software: an optimization model an optimization model for modeling combinatorial regulation from single-cell multi-omics provides units underpinning cell type landscape.

Requirements:

Homer in your environment PATH
Python >=3.0 and packages below:
numpy-groupies==0.9.15
numpy==1.26.3
scipy==1.12.0
pandas==1.4.3
ismember==1.0.2
mpmath==1.2.1
pybedtools==0.9.0
scikit-learn==1.4.2

Installing cRegulon with the following command:

wget https://github.com/SUwonglab/cRegulon/archive/master.zip
unzip master.zip
cd cRegulon-master
wget -O cRegulonData.tar.gz https://figshare.com/ndownloader/files/47503895
tar -xzvf cRegulonData.tar.gz

We also provide an example for cRegulon analysis, which can be downloaded from the following command:

wget -O cRegulonExample.tar.gz https://figshare.com/ndownloader/files/47503904
tar -xzvf cRegulonExample.tar.gz

After downloading, there will be 5 folders that store raw data, intermediate results, and final results of example:
./example_data/
./CSI/
./PseudoBulk/
./Networks/
./Results/

Tutorial of cRegulon analysis

Input single cell data of cRegulon

The typic input file of scRNA-seq data is a gene by cell count matrix:

scRNA	RNACellID1	RNACellID2	RNACellID3
Gene1	5	0	3
Gene2	0	2	0
Gene3	1	0	0

The typic input file of scATAC-seq data is a peak by cell count matrix:

scATAC	ATACellID1	ATACellID2	ATACellID3	ATACellID4
Peak1	1	0	1	0
Peak2	0	1	0	1
Peak3	1	0	0	0

The peaks are in the format of "chr_start_end".

In practice, the file of single cell dataset can be the 10x folder format (example dataset: ./example_data/RA/) or a matrix txt format (example dataset: ./example_data/CL/). In this tutorial, we use RA dataset for demonstration.

Input cell type meta data

The typic cell type meta file (./example_data/RA/RA_scRNA_Cluster.txt) of scRNA-seq data is as follows:

RNACellID1	RAC1
RNACellID2	RAC2
RNACellID3	RAC3

The typic cell type meta file (./example_data/RA/RA_scATAC_Cluster.txt) of scATAC-seq data is as follows:

ATACellID1	RAC2
ATACellID2	RAC3
ATACellID3	RAC1
ATACellID4	RAC2

cRegulon has three main modes, which are three steps of cRegulon model

python3 cRegulon.py {prep,grn,model} ...

prep: Preprocessing mode
grn: GRN mode
model: Model mode

Step 1: preprocessing and pseudo bulk (prep mode)

We run the following script to create pseudo bulk RNA-seq and ATAC-seq data for each cell cluster:

#cRegulon.py prep [-h] [--name NAME] --rna RNA --rna_meta RNA_META --atac ATAC --atac_meta ATAC_META --species SPECIES (human or mouse)
python3 cRegulon.py prep --name RA --rna ./example_data/RA/scRNA/ --rna_meta ./example_data/RA/RA_scRNA_Cluster.txt --atac ./example_data/RA/scATAC/ --atac_meta ./example_data/RA/RA_scATAC_Cluster.txt -g mouse

This process will produce pseudo bulk files (*PS_RNA.txt, *PS_ATAC.txt, *CellType.txt) for each cell cluster in the PseudoBulk folder.

Step 2: GRN construction (grn mode)

We run the following script to construct regulatory network for each cell cluster (current we support hg38 and mm10):

#python3 cRegulon.py grn [-h] --name NAME --celltype CELLTYPE --genome GENOME --cores CORES
for c in `cat ./PseudoBulk/RA_CellType.txt`
do
    python3 cRegulon.py grn -n RA -ct ${c} -g mm10 -p 20
done

This process will produce GRN files (*network.txt, TFTG_regulationScore.txt, TFName.txt, TGName.txt) for each cell cluster in the Networks folder (The GRN construction is independent for each cell cluster, we can do it parallelly).

Step 3: Running cRegulon model (model mode)

We run the following script of cRegulon model: If we already know or have some expection of the cRegulon number, we can provide this number to cRegulon. For example, we have 9 cRegulons for RA, then we run this script:

#python3 cRegulon.py model [-h] --name NAME --module_number MODULE_NUMBER
python3 cRegulon.py model -n RA -mn 9

If we don't know the cRegulon number, we can provide a range of numbers and cRegulon will use elbow rule to select an optimal number. For example, we guess there may be 4-20 cRegulons for RA, then we run this script:

#python3 cRegulon.py model [-h] --name NAME --module_max MODULE_MAX --module_min MODULE_MIN
python3 cRegulon.py model -n RA -mmin 4 -mmax 20

This will output a folder in "Results" with name you specify: "./RA/"

TF combinatorial effects in each cRegulon: ./Results/RA/X.txt
Association matrix between cell clusters and cRegulons: ./Results/RA/A.txt
TF module of each cRegulon: ./Results/RA/*TFModule.txt
Annotation of each cell cluster with cRegulons: ./Results/RA/Annotation/*subnetwork.txt

Annotation mode of cRegulon (annot mode)

If you only have scRNA-seq data, We run the following script to annotate cells with our pre-computed cRegulons from atlas-level dataset:

#python3 cRegulon.py annot [-h] --name NAME --path_rna PATH_RNA --module_number MODULE_NUMBER
python3 cRegulon.py annot --name PBMC --path_rna ./example_data/PBMC/PBMC_scRNA.txt --module_number 12

Citation:

If you use cRegulon software or cRegulon associated concepts, please cite:

Zhanying Feng, et al. Modeling combinatorial regulation from single-cell multi-omics provides units underpinning cell type landscape. 2024.

About

cRegulon is an optimization model to identify combinatorial regulon from single cell expression and chromatin accessibility data.

cell-type-annotation combinatorial-regulation tf-combination tf-module

GNU General Public License v3.0

Languages

Language:Python 100.0%