jiaying2508 / LYRUS

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

LYRUS: A Machine Learning Model for Predicting the Pathogenicity of Missense Variants

LYRUS incorporates five sequence-based, six structure-based, and four dynamics-based features. Uniquely, LYRUS includes a newly-proposed sequence co-evolution feature called variation number. LYRUS was trained using a dataset that contains 4,363 protein structures corresponding to 22,639 SAVs from the ClinVar database.

The method is described in Jiaying Lai, Jordan Yang, Ece D Gamsiz Uzun, Brenda M Rubenstein, Indra Neil Sarkar, LYRUS: a machine learning model for predicting the pathogenicity of missense variants, Bioinformatics Advances, Volume 2, Issue 1, 2022, vbab045, https://doi.org/10.1093/bioadv/vbab045.

LYRUS is built on top of several existing Python libraries as well as other Software, and is tested using Python3.7.4

Required python packages

Python packages (most of which can be installed using pip) needed to run LYRUS include:

Required external packages

LYRUS also depends on the following external packages:

Install command line version for:

  1. Clustal Omega: http://www.clustal.org/omega/
  2. PAUP: http://phylosolutions.com/paup-test/

Install the following files and put it in the LYRUS directory:

  1. plmc-master: https://github.com/debbiemarkslab/plmc
  2. FoldX: http://foldxsuite.crg.eu
  3. FreeSASA: https://freesasa.github.io
  4. MAESTRO: https://pbwww.services.came.sbg.ac.at/?page_id=477
  5. P2Rank: https://github.com/rdk/p2rank

Running Instructions

Clone this repository and run the following command within the downloaded directory, with python version 3.7.4 or higher.

import os
from LYRUS.lyrusClass import lyrusClass, lyrusPredict

gene = 'A1BG'
uniprot = 'P04217'
currDir = os.getcwd()
outputDir = '{}/test'.format(currDir)
try:
    os.mkdir(outputDir)
except:
    print('Output directory already exist')

#load model
lyrusModel = lyrusClass(gene, uniprot, outputDir, savFile=None)

#download orthologs from NCBI
lyrusModel.getFasta()

#download PDB from SWISS-MODEL
lyrusModel.getPDB()

#calculate all the parameters except for fathmm
lyrusModel.getParameters(maestroDir='MAESTRO_OSX_x64',p2rankDir='p2rank_2.2')

The fathmmFile should contain the output from FATHMM. To get the FATHMM output, go to http://fathmm.biocompute.org.uk/inherited.html and run using the fathmmInput.txt available in the output directory.

fathmmFile = 'test/fathmm.txt'

#calculate lyrus probability
lyrusPredict(gene, fathmmFile, outputDir, uniprot)

Alternative running instruction using lyrus.py

$ python lyrus.py -i <inputFile> -o <outputDir> -f <fathmmFile>

The inputFile should contain 2 column:

  1. UniProt ID
  2. Single amino acid variant: [aa_ref][aa_pos][aa_var]

Example inputFile:

Q9NQZ7 V363G
P11245 E203D
Q6XZF7 R1101Q
B1AL17 A139V
Q9NTN9-2 R423H
Q92887 T486I
............

The outputDir should be a full path to the desired directory to store the outputs

The fathmmFile should contain the output from FATHMM. To get the FATHMM output, go to http://fathmm.biocompute.org.uk/inherited.html and run using the inputFile.

Other data files

The data folder that includes pre-computed variation number and EVMutation score (using the same orthologs as the variation number; differs from the ones provided by the Marks Lab https://marks.hms.harvard.edu/evmutation/downloads.html) can be downloaded at https://drive.google.com/drive/folders/1bFMi78D4LqjGMDZiP_X6OzBBcsttSoSy?usp=sharing. If you decided to use the pre-computed scores, please put the data folder in the LYRUS directory.

Output Files:

  • LYRUS_input.csv contains the calculated feature values, which include nan
  • LYRUS_imputed.csv contains the imputed feature values
  • LYRUS_prediction.csv contains prediction results

About


Languages

Language:Python 100.0%