JASPAR profile inference tool

This repository contains the data and code used by the JASPAR profile inference tool. For more information please refer to the supplementary data from JASPAR 2016 and 2020.

News

01/03/2024 We have updated the profile inference tool with new profiles for the 2024 release of JASPAR. 01/09/2021 We have updated the profile inference tool with new profiles for the 2022 release of JASPAR. 31/01/2021 We have updated the profile inference tool as described in the similarity regression manuscript. ~~01/09/2019 We have improved the profile inference tool by implementing our own similarity regression method.~~

Content

The conda folder contains contains the environment.yml file used to develop the profile inference tool for JASPAR 2020 (see installation)
The examples folder contains the sequences of two transcription factors (TFs) and one protein that is not a transcription factor, such as the human serine/threonine-protein kinase mTOR
The files folder contains the output of the script get-files.py, which downloads TF sequences from UniProt, DNA-binding domains (DBDs) from Pfam, retrieves infernece models from Cis-BP, etc.
~~The models folder contains the similarity regression models created by calling the script pairwise.py followed by regression.py~~
The script infer-profile.py takes as input ~~the folders files and models, plus~~ one or more proteic sequences in FASTA format (e.g. a proteome), and infers DNA-binding profiles from JASPAR

The original scripts used for the publication of JASPAR 2016 have been placed in the folder version-1.0.

Dependencies

BLAST+
HMMER (version ≥3.0)
Python 3 with the following libraries: Biopython, CoreAPI, GitPython, joblib, ~~glmnet~~, NumPy, pandas, ~~ProDy, SciPy, scikit-learn~~ and tqdm
~~The RSAT matrix-clustering tool~~
~~Tomtom as distributed in the MEME suite (version ≥5.0)~~

Note that for running infer_profile.py, the CoreAPI, GitPython, ~~glmnet, SciPy and scikit-learn,~~ and ProDy python packages are not required.

Installation

All dependencies can be installed through the conda package manager:

conda env create -f ./conda/environment.yml

Update

To update the tool to the latest release of JASPAR, execute get_files.py as follows:

cd files
./get_files.py --update

Usage

To illustrate how the profile inference tool can be used, we provide an example for the zebra fish TF egr1, and the fission yeast TF tbp1:

$ ./infer-profile.py --latest ./examples/egr1+tbp1.fa 
100%|████████████████████| 2/2 [00:08<00:00,  4.28s/it]
Query   TF Name TF Matrix       E-value Query Start-End TF Start-End    DBD %ID
sp|P26632|EGR1_DANRE    EGR1    MA0162.2        0.0     1-511   1-543   0.971
sp|P26632|EGR1_DANRE    EGR3    MA0732.1        6.81e-89        57-410  38-374  0.899
sp|P26632|EGR1_DANRE    Egr2    MA0472.1        5.95e-72        55-398  38-424  0.942
sp|P26632|EGR1_DANRE    EGR4    MA0733.1        9.11e-51        306-401 478-573 0.783
sp|P17871|TBP_SCHPO     SPT15   MA0386.1        8.18e-126       17-230  29-239  0.912
sp|P17871|TBP_SCHPO     TBP     MA0108.2        3.66e-109       8-230   114-337 0.771

The tool infers that the motif of sp|P26632|EGR1_DANRE should be similar to EGR1, EGR2, EGR3 and EGR4, and that the motif of sp|P17871|TBP_SCHPO should be similar to SPT15 and TBP.

As a Python module

from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
import importlib
infer_profile = importlib.import_module("infer-profile")

# Transcription factor Sox-3-B of Xenopus laevis
# https://www.uniprot.org/uniprot/Q5FWM3.fasta
seq = [
    "MYSMLDTDMKSPVQQSNALSGGPGTPGGKGNTSTPDQDRVKRPMNAFMVWSRGQRRKMAQ",
    "ENPKMHNSEISKRLGADWKLLSDSEKRPFIDEAKRLRAVHMKDYPDYKYRPRRKTKTLLK",
    "KDKYSLPGNLLAPGINPVSGGVGQRIDTYPHMNGWTNGAYSLMQEQLGYGQHPAMNSSQM",
    "QQIQHRYDMGGLQYSPMMSSAQTYMNAAASTYSMSPAYNQQSSTVMSLASMGSVVKSEPS",
    "SPPPAITSHTQRACLGDLRDMISMYLPPGGDAGDHSSLQNSRLHSVHQHYQSAGGPGVNG",
    "TVPLTHI"
]

# Load data
cisbp = infer_profile.__load_CisBP_models()
jaspar = infer_profile.__load_JASPAR_files_n_models()

# Infer profiles
seq_record = SeqRecord(Seq("".join(seq)), id="Sox-3-B")
inferred_profiles = infer_profile.infer_SeqRecord_profiles(
    seq_record, cisbp, jaspar, latest=True)

# Print
rows = [["Query", "TF Name", "TF Matrix", "E-value", "Query Start-End",
        "TF Start-End", "DBD %ID"]]
for inferred_profile in inferred_profiles:
    rows.append(inferred_profile)
for row in rows:
    print("\t".join(map(str, row)))

Query   TF Name TF Matrix       E-value Query Start-End TF Start-End    DBD %ID
Sox-3-B Sox3    MA0514.1        3.91e-129       1-307   1-375   0.942
Sox-3-B POU2F1::SOX2    MA1962.1        9.56e-115       1-307   1-317   0.913
Sox-3-B SOX2    MA0143.4        9.56e-115       1-307   1-317   0.913
Sox-3-B Pou5f1::Sox2    MA0142.1        7.27e-112       1-307   1-319   0.913
Sox-3-B Sox2    MA0143.1        7.27e-112       1-307   1-319   0.913
Sox-3-B Sox1    MA0870.1        6.37e-81        1-307   1-391   0.884
Sox-3-B SOX21   MA0866.1        6.71e-54        38-127  6-95    0.899
Sox-3-B SOX14   MA1562.1        2.24e-53        38-127  6-95    0.884
Sox-3-B D       MA0445.1        7.12e-45        32-145  134-239 0.826
Sox-3-B SOX15   MA1152.1        3.64e-44        22-117  34-126  0.812
Sox-3-B SRY     MA0084.1        1.11e-41        27-187  51-198  0.667
Sox-3-B Sox11   MA0869.1        4.58e-35        40-117  49-126  0.696
Sox-3-B SOX18   MA1563.1        3.38e-34        25-117  70-162  0.551
Sox-3-B SOX4    MA0867.1        1.36e-33        40-117  59-136  0.696
Sox-3-B Sox17   MA0078.1        1.6e-33 18-143  50-168  0.58
Sox-3-B SOX12   MA1561.1        1.11e-32        40-116  40-116  0.681
Sox-3-B SOX9    MA0077.1        9.43e-32        31-114  96-179  0.638
Sox-3-B SOX8    MA0868.1        1.56e-31        40-114  102-176 0.652
Sox-3-B SOX10   MA0442.1        1.74e-31        31-128  95-192  0.638
Sox-3-B Sox6    MA0515.1        6.99e-26        40-126  620-706 0.551
Sox-3-B Sox5    MA0087.1        1.03e-25        40-126  556-642 0.551
Sox-3-B SOX13   MA1120.1        4.97e-25        40-120  424-504 0.551

wassermanlab / JASPAR-inference-tool