SATORI v0.2

SATORI is a Self-ATtentiOn based deep learning model that captures Regulatory element Interactions in genomic sequences. It can be used to infer a global landscape of interactions in a given genomic dataset, with a minimal post-processing step.

Manuscript

Fahad Ullah, Asa Ben-Hur, A self-attention model for inferring cooperativity between regulatory features, Nucleic Acids Research, 2021;, gkab349, https://doi.org/10.1093/nar/gkab349

Dependency

SATORI is written in python 3. The following python packages are required:
biopython (version 1.75)
captum (version 0.2.0)
fastprogress (version 0.1.21)
matplotlib (vresion 3.1.3)
numpy (version 1.17.2)
pandas (version 0.25.1)
pytorch (version 1.2.0)
scikit-learn (vresion 0.24)
scipy (version 1.4.1)
seaborn (version 0.9.0)
statsmodels (version 0.9.0)

and for motif analysis:
MEME suite
WebLogo

Installation

Download SATORI (via git clone):

git clone git@github.com:fahadahaf/satori.git satori

Navigate to the cloned directory:

cd satori

Install SATORI:

python setup.py install

Make the main script (satori.py) executable:

chmod +x satori.py

(Optional) To execute the script everywhere, update the PATH and PYTHONPATH environment variables:

export PATH=path-to-satori:$PATH
export PYTHONPATH=path-to-satori/satori:$PYTHONPATH

Usage

usage: satori.py [-h] [-v] [-o DIRECTORY] [-m MODE] [--deskload]
                 [-w NUMWORKERS] [--splitperc SPLITPERC] [--motifanalysis]
                 [--scorecutoff SCORECUTOFF] [--tomtompath TOMTOMPATH]
                 [--database TFDATABASE] [--annotate ANNOTATETOMTOM] [-i]
                 [-b INTBACKGROUND] [--attncutoff ATTNCUTOFF]
                 [--fiscutoff FISCUTOFF] [--intseqlimit INTSEQLIMIT] [-s]
                 [--numlabels NUMLABELS] [--tomtomdist TOMTOMDIST]
                 [--tomtompval TOMTOMPVAL] [--testall] [--useall]
                 [--precisionlimit PRECISIONLIMIT]
                 [--attrbatchsize ATTRBATCHSIZE] [--method METHODTYPE]
                 inputprefix hparamfile

Main SATORI script.

positional arguments:
  inputprefix           Input file prefix for the bed/text file and the
                        corresponding fasta file (sequences).
  hparamfile            Name of the hyperparameters file to be used.

optional arguments:
  -h, --help            show this help message and exit
  -v, --verbose         verbose output [default is quiet running]
  -o DIRECTORY, --outDir DIRECTORY
                        output directory
  -m MODE, --mode MODE  Mode of operation: train or test.
  --deskload            Load dataset from desk. If false, the data is
                        converted into tensors and kept in main memory (not
                        recommended for large datasets).
  -w NUMWORKERS, --numworkers NUMWORKERS
                        Number of workers used in data loader. For loading
                        from the desk, use more than 1 for faster fetching.
  --splitperc SPLITPERC
                        Pecentages of test, and validation data splits, eg. 10
                        for 10 percent data used for testing and validation.
  --motifanalysis       Analyze CNN filters for motifs and search them against
                        known TF database.
  --scorecutoff SCORECUTOFF
                        In case of binary labels, the positive probability
                        cutoff to use.
  --tomtompath TOMTOMPATH
                        Provide path to where TomTom (from MEME suite) is
                        located.
  --database TFDATABASE
                        Search CNN motifs against known TF database. Default
                        is Human CISBP TFs.
  --annotate ANNOTATETOMTOM
                        Annotate tomtom motifs. The value of this variable
                        should be path to the database file used for
                        annotation. Default is None.
  -i, --interactions    Self attention based feature(TF) interactions
                        analysis.
  -b INTBACKGROUND, --background INTBACKGROUND
                        Background used in interaction analysis: shuffle (for
                        di-nucleotide shuffled sequences with embedded
                        motifs.), negative (for negative test set). Default is
                        not to use background (and significance test).
  --attncutoff ATTNCUTOFF
                        Attention cutoff value. For a given interaction, it
                        should have an attention value at least as high as
                        this value across all examples.
  --fiscutoff FISCUTOFF
                        FIS score cutoff value. For a given interaction, it
                        should have an FIS score at least as high as this
                        value across all examples.
  --intseqlimit INTSEQLIMIT
                        A limit on number of input sequences to test. Default
                        is -1 (use all input sequences that qualify).
  -s, --store           Store per batch attention and CNN outpout matrices. If
                        false, the are kept in the main memory.
  --numlabels NUMLABELS
                        Number of labels. 2 for binary (default). For multi-
                        class, multi label problem, can be more than 2.
  --tomtomdist TOMTOMDIST
                        TomTom distance parameter (pearson, kullback, ed etc).
                        Default is euclidean (ed). See TomTom help from MEME
                        suite.
  --tomtompval TOMTOMPVAL
                        Adjusted p-value cutoff from TomTom. Default is 0.05.
  --testall             Test on the entire dataset (default False). Useful for
                        interaction/motif analysis.
  --useall              Use all examples in multi-label problem instead of
                        using precision based example selection. Default is
                        False.
  --precisionlimit PRECISIONLIMIT
                        Precision limit to use for selecting examples in case
                        of multi-label problem.
  --attrbatchsize ATTRBATCHSIZE
                        Batch size used while calculating attributes for FIS
                        scoring. Default is 12.
  --method METHODTYPE   Interaction scoring method to use; options are:
                        SATORI, FIS, or BOTH. Default is SATORI.

Tutorial

Pre-processing

TO-DO

Example: binary classification

For the TAL-GATA experiment:

satori.py data/TAL-GATA_ChIPSeq/Final_dataset_combined_uniq_neg80k_binaryFeat modelsparam/CNN-RNN-MH-noEmbds_hyperParams.txt -w 8 --outDir results/TAL-GATA_Analysis --mode train -v -s --background negative --intseqlimit 5000 --numlabels 2 --motifanalysis --interactions --method BOTH --attrbatchsize 18 --deskload --tomtompath PATH-TO-TOMTOM-TOOL --database PATH-TO-MEME-TF-DATABASE

Example: multi-label classification

For the arabidopsis genomewide chromatin accessibility dataset:

satori.py data/Arabidopsis_ChromAccessibility/atAll_m200_s600 modelsparam/CNN-RNN-MH-noEmbds_hyperParams.txt -w 8 --outDir results/Arabidopsis_GenomeWide_Analysis --mode train -v -s --background shuffle --intseqlimit 5000 --numlabels 36 --motifanalysis --interactions --method BOTH --attrbatchsize 32 --deskload --tomtompath PATH-TO-TOMTOM-TOOL --database PATH-TO-MEME-TF-DATABASE

Note: make sure to specify path to the TomTom tool and the corresponding motif database.
PATH-TO-TOMTOM-TOOL path to TomTom tool in the MEME suite.
PATH-TO-MEME-TF-DATABASE path to the TF database to use (MEME suite comes with different databases).

Post-processing

The resutls are processed in separate Jupyter notebooks in the analysis directory. The notebooks assume that the results are in results folder, at the root (top level) directory of the repository.

jeffhsu3 / satori