Lollipop is a machine-learning-based framework for predicting the CTCF-mediated interactome by integrating genetic, epigenetic and gene expression data. In our paper Predicting CTCF-mediated chromatin interactions by integrating genomic and epigenomic features(Kai et.al 2018), it was used for:
- Creating positive and negative training data.
- Training a model that distinguishes positive loops from negative loops.
- Applying the trained model to a cell-type of interest to make de novo predictions of CTCF-mediated loops.
Lollipop requires the following packages:
- Numpy
http://www.numpy.org
- Pandas
http://pandas.pydata.org
- Scikit-learn
http://scikit-learn.org/stable/
- HTSeq
https://htseq.readthedocs.io/en/release_0.9.1/
We recommend to use Anaconda python distribution for installation of the above packages.
If I want to make predictions in a cell-type of interest, do I have to have all the features used here?
Here we used 77 features in total to predict CTCF loops in the three cell-types. However, if you want to make predictions in a cell-type of interest, you don't need to incorporate all these features because sometimes some features are inavailable. In such cases, please train a model by using the selected (i.e. available) features in one of the three cell-types to train a model, and apply this model to the cell-type of interest.
A complete list of used genomic and epigenomic data can be seen in the signal table in data/example_signal_table.txt
. Some notes about data format:
- CTCF motifs and underlying sequence conservation. A prepared file that is ready to use can be downloaded from
data
. - ChIP-seq data sets. Both the sequecing files and peaks are in BED format (only the first 3 columns will be used).
- Gene expression file format. The data format is listed below, and please keep the exact file header as listed below.
gene | chrom | promoter_start | promoter_end | expression |
---|
Pre-generated training data used in the paper can be downloaded here. The data format is:
chrom | start1 | start2 | response | ...features... |
---|
One can also generate training data for any cell-type of interest, as long as experimental data for CTCF-mediated loops, such as CTCF ChIA-PET and Hi-ChIP data, are available. It takes two steps to do so:
Usage:
python prepare_training_interactions.py -p $CTCF_peak -a $CTCF_ChIA-PET_interactions -c $CTCF_HiC_interactions -o $training_interactions
Parameters:
-p $CTCF_peak:
CTCF peak file in BED format.
-a $CTCF_ChIA-PET_interactions:
CTCF-mediated interactions identified by ChIA-PET or other methods. The file format is chrom1 start1 end1 chrom2 start2 end2 IAB FDR strand1 strand2
, where IAB
is the number of PETs connecting the anchors and FDR
is the statistical significance.
-c $CTCF_HiC_interactions:
CTCF-mediated interactions identified by HiC. The file format is chrom start1 start2 response length
. We use this file to make sure the negative loops are not those real loops identified by methods other than ChIA-PET. If it is not available, you can provide an empty file but keep the header.
-o $training_interactions:
Output file with positive and negative loops in the following format: chrom anchor1 anchor2 response loop-length
, where anchor1/2
is the genomic coordinate of the middle point of left/right anchor.
Usage:
python add_features.py -i $training_interactions -t $information_table -o $training_data
Parameters:
-i $training_interactions:
Output file from step1.
-t $information_table:
A table containing the paths of genomic and epigenomic datasets to derive features. An example of this table can be seen in data
.
-o $training_data:
Output file with positive and negative loops characterized by a set of features.
A model can be generated from the prepared training data, by using train_model.py
.
Usage:
python train_model.py -t $training_data -o $output_folder
Parameters:
-t $training_data:
The file with positive and negative loops characterized by features.
-o $output_folder:
The path of the folder where you want to put the resulting model and cross-validation results. ROC and PR curves are generated.
Lollipop employs a random forest classifier to distinguish positive from negative loops. The classifier trained from three cell-lines (in .pkl
format) and the de novo predictions made by each classicier are available in denovo_predictions
. The format of predicted loops is:
chrom | start1 | start2 | probability | yes_or_no |
---|
Predicted loops that can be visualized in genome browsers, including UCSC genome browser, IGV and Washington U genome browser, are also available in the same folder.
One can also apply the trained models to make de novo predictions in a cell-type of interest by running make_denovo_predictions.py
.
Usage:
python make_denovo_predictions.py -b $CTCF_Peaks -t $information_table -c $classifier -f True or False -o $output_folder
Parameters:
-b $CTCF_Peaks:
CTCF peak file in BED format.
-t $information_table:
A table containing the paths of genomic and epigenomic datasets to derive features.
-c $classifier:
The trained classifier used for making predictions.
-f True or False:
An option to choose whether or not to output features for predicted loops. Default is 'False'.
-o $output_folder:
The output folder for results. Output files include predicted loops in different formats that can be visulized in genome browser.