manojmw / MultiOmics-ExomeSeq-Phenotype

Main Repository for my MASTER'S THESIS PROJECT

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Table of Contents

Introduction

  • This is the main repository containing all the scripts for my Master's thesis project.
  • My project was to develop a multi-omics method for scoring genomic variants that might be potentially causal to a particular phenotype in the patient by supervised machine learning.
  • The classifier takes input features (an aggregation of different omics data as scalar values) to produce a score between 0 and 1.
    • Higher score: more likely that the variant is potentially causal
    • Lower score: less likely that the variant is potentially causal
  • The algorithm finally ranks the variants based on these scores to identify novel disease genes in each patient.
  • This repository contains individual scripts which work at the Gene level.
  • I have integrated these into the Exome-Seq Secondary Analysis Pipeline (click here) which works at the Clinical level.
  • The result files produced by this pipeline now contain the following data:
    • Clinical
    • Variant
    • Gene
    • Interactome
    • Expression

NOTE:

  • The Machine learning scripts are used for prioritizing genomic variants.
  • It can be used on the patient sample result file generated by the pipeline.
  • You can find the example usage of the scripts in the MachineLearning directory of the repository.

Example Usage

UniProt Parser

  • Parses on STDIN a UniProt file and extracts the required data from each record
  • Prints to STDOUT in .tsv format

-> Grab the latest UniProt data with:

wget https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz

-> Parse UniProt data to produce output with:

gunzip -c uniprot_sprot.dat.gz | python3 1_Uniprot_parser.py > Uniprot_output.tsv


Protein-Protein Interaction Parser

  • Parses a Protein-Protein Interaction (PPI) File (miTAB 2.5 or 2.7)
  • Maps to UniProt using the output files produced by 1_Uniprot_parser.py and prints to STDOUT in .tsv format

1] Grab the latest BioGRID data with:

wget https://downloads.thebiogrid.org/Download/BioGRID/Latest-Release/BIOGRID-ORGANISM-LATEST.mitab.zip

-> Unzip with:

unzip BIOGRID-ORGANISM-LATEST.mitab.zip

-> This will produce one miTAB File per Organism (Use BIOGRID-ORGANISM-Homo_sapiens*.mitab.txt for human data)


2] Grab the latest IntAct data with:

wget ftp://ftp.ebi.ac.uk/pub/databases/intact/current/psimitab/intact.zip

-> Unzip with:

unzip intact.zip

-> This will produce 2 files (intact.txt & intact_negative.txt). Use intact.txt for further steps


3] Parse PPI data with:

python3 2_Interaction_parser.py --inInteraction BIOGRID-ORGANISM-Homo_sapiens*.mitab.txt --inUniprot Uniprot_output.tsv > Exp_Biogrid.tsv
python3 2_Interaction_parser.py --inInteraction intact.txt --inUniprot Uniprot_output.tsv > Exp_Intact.tsv

-> The above example is for the Protein-Protein Interaction data from BioGRID and IntAct. However, you can retrieve PPI data (in miTAB format) from any database and feed it to the script to produce an output file.



PPI Experiment Count

  • Parses a Protein-Protein Interaction File (miTAB 2.5 or 2.7)
  • Prints the count of Human-Human Protein Interaction experiments to STDOUT

-> Provide a STDIN miTAB 2.5 or 2.7 file with:

python3 3_Count_HumanPPIExp.py < miTAB File


Build Interactome (True Binary Interactions only)

  • High-Quality Interactome Criteria:

    1] Filtering Interactions based on Interaction Detection Method: - We filter out pull down (MI:0096), genetic interference (MI:0254) & unspecified method (MI:0686)

    2] Filtering Interactions based on Interaction Type: - We keep only direct interaction (MI:0407) & physical association (MI:0915)

    3] Here, we try to eliminate most of the EXPANSION DATA and consider only TRUE BINARY INTERACTIONS

    4] Each Interaction has ≥ 2 experiments, of which at least one of them should be proved by any BINARY METHOD

    5] Eliminating Hub/Sticky proteins (A protein is considered a hub if it has > 120 interactors. This number is based upon the degree distribution of the entire Interactome before eliminating hub/sticky proteins).

-> Build High-Quality Human Interactome with:

python3 4_BuildInteractome_BinaryPPIonly.py --inExpFile Exp_Biogrid.tsv Exp_Intact.tsv --inUniprot Uniprot_output.tsv --inCanonicalFile canonicalTranscripts_*.tsv.gz > Interactome_human.tsv

-> For getting canonical transcripts file, please refer to grexome-TIMC-Secondary -> Build Interactome scripts accepts multiple processed Protein-Protein Interaction experiment file (--inExpFile)



Build Interactome (True Binary Interactions with Expansion)

  • High-Quality Interactome Criteria:

    1] Filtering Interactions based on Interaction Detection Method: - We filter out genetic interference (MI:0254) & unspecified method (MI:0686)

    2] Here, we consider both TRUE BINARY INTERACTIONS and PPIs derived from EXPANSION

    4] Each Interaction should be proven by ≥ 2 experiments

  • Note: The Interactome containing expansion data should not be used for identifying disease-enriched modules as the clustering algorithms fail to cluster the network correctly, leading to wrong results. I optionally included this script if someone wants to use it for other purposes.

-> Build High-Quality Human Interactome with:

python3 5_BuildInteractome_BinaryPPIwithExpansion.py --inExpFile Exp_Biogrid.tsv Exp_Intact.tsv --inUniprot Uniprot_output.tsv --inCanonicalFile canonicalTranscripts_*.tsv.gz > Interactome_human_binarywithexpansion.tsv


Module Input File Generator

  • Parses the output produced by 4_BuildInteractome_BinaryPPIonly.py
  • Assigns a default edge weight = 1 for each interaction and prints to STDOUT in .tsv format
  • This can be used as INPUT for most of the module identification/clustering methods

-> Generate Module Input File with:

python3 6_ModuleInputFile.py < Interactome_human.tsv


UniProt2ENSG Mapper

  • Parses the output files produced by 1_Uniprot_parser.py and the canonical transcripts file (Ex: canonicalTranscripts_220221.tsv)
  • Maps UniProt accession to ENSG and prints to STDOUT

-> Run UniProt2ENSG Mapper with:

python3 7_Uniprot2ENSG.py --inUniprot Uniprot_output.tsv --inCanonicalFile canonicalTranscripts_220221.tsv


Naïve Approach (1-hop Neighborhood Approach)

  • Parses the Sample metadata file (.xlsx), UniProt File, Canonical transcripts file, Candidate Gene file(s), Interactome file, and GTEX File
  • For a given gene:
    • Checks if the gene is already a known candidate
    • Checks the number of Interactors
    • Checks the number of Interactors that are known, candidates
    • Applies Fisher's Exact test to compute P-values
    • Adds the total count & a comma-separated list of candidate genes within the 2-hop neighborhood
    • Additionally adds GTEX data
    • Prints to STDOUT in .tsv format
  • This script provides one of the scoring components for the Machine Learning step

-> Run 8_NaiveApproach.py script with:

python3 8_NaiveApproach.py --inSampleFile sample.xlsx --inUniprot Uniprot_output.tsv --inCandidateFile candidateGenes.xlsx --inCanonicalFile canonicalTranscripts_220221.tsv --inInteractome Interactome_human.tsv --inGTEXFile E-MTAB-5214-query-results.tpms.tsv

-> You can use the GTEX file provided in this repository. (Note: The GTEX file provided in the repository might not be the latest. If you want to retrieve the latest GTEX file, please visit https://www.ebi.ac.uk/gxa/home).



DREAM Challenge: Cluster File Processing

  • This script is for processing the cluster file produced by the MONET TOOL (DREAM Challenge)
  • Parses on STDIN a .tsv file produced by the MONET tool, processes it, and prints to STDOUT in .cls format
  • The output can be used as the input Cluster File for 10_Naive_withClusteringApproach.py
  • Note: For producing the Interactome Clustering File and using this script, please refer to the "Interactome Clustering Methods" section


Naïve with Clustering Approach

  • This script is similar to 8_NaiveApproach.py, but the output additionally contains the Interactome Clustering data

-> Run 10_Naive_withClusteringApproach.py script with:

python 10_Naive_withClusteringApproach.py --inSampleFile sample.xlsx --inUniprot Uniprot_out.tsv --inCandidateFile candidateGenes_*.xlsx --inCanonicalFile canonicalTranscripts_220221.tsv --inInteractome Interactome_human --inClusterFile K1Clustering_clusterFile.cls --inGTEXFile E-MTAB-5214-query-results.tpms.tsv
  • Clustering data provide an additional scoring component for the Machine Learning step

Output

  • For a detailed description of the scripts' output, please use the-help, -h option.
  • You can also view the sample output files provided in the Sample_Output_Files directory of the repository

Interactome Clustering Methods

  • We consider Clusters with a size of >= 3 and 130 (max)

  • If the cluster size exceeds 130, the methods are applied recursively (MONET tool automatically does this) to obtain clusters of the desired size.

  • I have tested mainly four types of clustering methods:

    1] Kernel clustering approach (K1 method from DREAM Challenge) (Choobdar, Sarvenaz, et al. "Assessment of network module identification across complex diseases." Nature methods vol. 16,9 (2019): 843-852. doi:10.1038/s41592-019-0509-5)

    2] Modularity Optimization method (M1 method from DREAM Challenge) (Choobdar, Sarvenaz, et al. "Assessment of network module identification across complex diseases." Nature methods vol. 16,9 (2019): 843-852. doi:10.1038/s41592-019-0509-5)

    3] Random-walk-based method (R1 method from DREAM Challenge) (Choobdar, Sarvenaz, et al. "Assessment of network module identification across complex diseases." Nature methods vol. 16,9 (2019): 843-852. doi:10.1038/s41592-019-0509-5)


     - To run the above clustering methods on the Interactome file generated by Build_Interactome.py, please use the MONET tool (Tomasoni, Mattia et al. "MONET: a toolbox integrating top-performing methods for network modularization." Bioinformatics (Oxford, England) vol. 36,12 (2020): 3920-3921. doi:10.1093/bioinformatics/btaa236)  available at: https://github.com/BergmannLab/MONET
    
     - If you will be using the cluster file produced by these two methods, then please process it using ProcessClusterFile_MONET.py script using the command:
    
       % cat cluster_outputFile.tsv | python3 9_ProcessClusterFile_MONET.py > ClusterFile.cls
    
     - Input File Description:
    
          cluster_outputFile.tsv:    Clustering Output File produced by the MONET tool (DREAM Challenge)
    

    4] Randomized optimization of modularity (Didier, Gilles, et al. "Identifying communities from multiplex biological networks by randomized optimization of modularity." F1000Research vol. 7 1042. 10 Jul. 2018, doi:10.12688/f1000research.15486.2)


    - To run this clustering method on the Interactome file generated by Build_Interactome.py, please use the MolTi-DREAM tool described at: https://github.com/gilles-didier/MolTi-DREAM
    
    - This might generate some large clusters (i.e., size > 130). In such cases, please run the tool recursively as described on the MolTi-DREAM GitHub page
    
    - The output produced by this tool need not be processed further and can be directly used for the 5.2_addInteractome.py script
    
  • You can use one of the above or any other clustering methods, but the Cluster File (please refer to the sample_clusterFile.cls File in the Sample_Input_Files directory of the repository) should be of the format:


    - Header: (Ex: #ClustnSee analysis export)
    - Followed by ClusterID (Ex: ClusterID:1||)
    - Followed by Name(ENSG) of the Cluster(s) (Ex: ENSG00000162819)
    - An empty line indicates end of a given Cluster
    

Arguments

Arguments [defaults] -> Can be abbreviated to shortest unambiguous prefixes

# UniProt Files
   --inUniprot                          A tab-separated Input File name (produced by 1_Uniprot_parser.py) containing UniProt Primary Accession, Taxonomy Identifier, ENST(s), ENSG(s), UniProt Secondary Accession(s), Gene ID(s) & Gene name(s)

# Protein-Protein Interaction File(s)                                     
   --inInteraction                      miTAB 2.5 or 2.7 Input File name (Protein-Protein Interaction File)

# Protein-Protein Interaction Experiment File(s)   
   --inExpFile                          PPI Experiments Input File name (produced by 2_Interaction_parser.py)

# Canonical Transcripts File
   --inCanonicalFile                    Canonical Transcripts Input File name (.gz or non .gz)
   
# Sample File
   --inSampleFile                       Sample Metadata Input File name (.xlsx)   

# Candidate Gene File(s)
   --inCandidateFile                    Candidate Gene Input Files(s) name (.xlsx)

# Interactome File
   --inInteractome                      High-Quality Interactome Input File name (produced by 4_BuildInteractome_BinaryPPIonly.py/5_BuildInteractome_BinaryPPIwithExpansion.py)
   
# GTEX File
   --inGTEXFile                         GTEX Input File name (.tsv)

# Help
   -h, --help                           Show the help message and exit

Metadata files

  • Currently, the scripts use 2 metadata files i.e. samples.xlsx & candidateGenes.xlsx.
  1. samples.xlsx:

    • This metadata file describes the samples.
    • Required column:
      • pathologyID: the phenotype of each patient/sample, used to define the "cohorts".
      • sampleID: unique identifier for each sample (Used by Machine learning scripts)
    • Optional columns (currently not used by the scripts) such as:
      • specimenID: the external identifier for each sample, typically related to the BAM or FASTQ filenames.
      • patientID: a more user-friendly identifier for each sample
      • Sex: 'F' or 'M'
  2. candidateGenes.xlsx:

    • Lists known candidate genes/implicated seed genes
    • Required columns:
      • Gene: gene name (should be the HGNC name, see www.genenames.org).
      • pathologyID: pathology/phenotype
    • Optional column (currently not used by the scripts) such as:
      • Confidence score: indicates how confident you are that LOF variants in this gene are causal for this pathology. Value: integers from 1 and 5 (5 meaning the gene is definitely causal, while 1 is a lower-confidence candidate).

Dependencies

  • Python version >= 3
  • External dependencies are kept to a minimum in all the scripts. The only required python modules are listed below:
    • OpenPyXl == 3.0.10
    • SciPy == 1.5.2
  • You can install these with pip/conda
  • Most other standard core modules should already be available on your system
  • Additional dependencies for Machine learning:
    • Scikit-learn == 1.1.1
    • Imbalanced-learn == 0.9.1
    • Pandas == 1.4.3
    • Joblib == 1.1.0

License

Licensed under GNU General Public License v3.0 (Refer to LICENSE file for more details)

About

Main Repository for my MASTER'S THESIS PROJECT

License:GNU General Public License v3.0


Languages

Language:Python 99.9%Language:Apex 0.1%