FRAGSYS

This repository contains the fragment screeening analysis pipeline (FRAGSYS) used for the analysis of our manuscript Classification of likely functional class for ligand binding sites identified from fragment screening.

Our pipeline for the analysis of binding sites, FRAGSYS, can be executed from the jupyter notebook running_fragsys.ipynb. The input for this pipeline is a table containing a series of PDB codes and their respective UniProt accession identifiers.

Installation

For complete installation instructions refer here.

Pipeline methodology

Refer to run jupyter notebook running_fragsys.ipynb in order to run FRAGSYS. You can do so interactively in a notebook by running this command: main(main_dir, prot, panddas) using the appropriate environment: varalign_env.

Where main_dir is the directory where the output will be saved, prot is the query protein, and panddas is a pandas dataframe that has to contain at least two columns: entry_uniprot_accession, and pdb_id, for all protein structures in the data set.

For another example, check this other notebook where we ran FRAGSYS for the main protease (MPro) of SARS-CoV-2 (P0DTD1).

For each structural segment of each protein in panddas, FRAGSYS will:

Download biological assemblies from PDBe
Structurally superimpose structures using STAMP
Get accessibility and secondary structure elements from DSSP via ProIntVar
Mapping PDB residues to UniProt using SIFTS
Obtain protein-ligand interactions running Arpeggio
Cluster ligands into binding sites using OC
Generate visualisation scripts for UCSF Chimera
Generate multiple sequence alignment (MSA) with jackhmmer
Calculate Shenkin divergence score [1]
Calculate missense enrichment scores with VarAlign

The final output of the pipeline consists of multiple tables for each structural segment collating the results from the different steps of the analysis for each residue, and for the defined ligand binding sites. These data include relative solvent accessibility (RSA), angles, secondary structure, PDB/UniProt residue number, alignment column, column occupancy, divergence score, missense enrichment score, p-value, etc.

These tables are concatenated into master tables, with data for all 37 structual segments, which form the input for the analyses carried out in the analysis notebooks.

Refer to notebook 15 to predict RSA cluster labels for your binding sites of interest.

Dependencies

The pipeline, as well as the whole of the analysis are run in an interactive manner in a series of jupyter notebooks, found in the analysis folder.

Third party dependencies for these notebooks include:

Other standard python libraries:

For more information on the dependencies, refere to the .yml files in the envs directory. To install all the dependencies, refer to the installation manual.

Files

Apart from the INSTALL, LICENSE and README files, there are 5 other files on this repository main directory. Two of these are python libraries, a configuration file and two notebooks.

fragsys_config.txt contains the default parameters to run FRAGSYS and it is read by fragsys.py.
fragsys.py contains all the function, lists and dictionaries needed to run the pipeline.
fragsys_main.py contains the main FRAGSYS function, where all functions in fragsys.py are called. This script represents the pipeline itself.
running_fragsys.ipynb is the notebook where the pipeline is executed in an interactive way.
running_fragsys_for_MPRO.ipynb.ipynb is the notebook where the pipeline is executed in an interactive way for a case study of SARS-CoV-2 MPro.

Directories

There are 6 directories in this repository.

`scripts`

This environment contains clean_pdb.py, a python script grabbed from here. This script will be used to pre-process the PDB files before running Arpeggio on them.

`envs`

The envs folder contains three .yml files describing the necessary packages and dependencies for the different parts of the pipeline and analysis.

arpeggio_env contains Arpeggio.
deep_learning_env contains the packages necessary to do the machine learning in notebooks 11, and 12.
main_env supports all analysis notebooks, with the exception of number 11, 12, in which the machine learning models are executed.
varalign_env is needed to run FRAGSYS.

Citation

If you use FRAGSYS, please cite:

Utgés, J.S. et al. Classification of likely functional class for ligand binding sites identified from fragment screening. Commun Biol 7, 320 (2024). https://doi.org/10.1038/s42003-024-05970-8

References

Shenkin PS, Erman B, Mastrandrea LD. Information-theoretical entropy as a measure of sequence variability. Proteins. 1991; 11(4):297–313. Epub 1991/01/01. https://doi.org/10.1002/prot.340110408 PMID: 1758884.

About

MIT License

Languages

Language:Jupyter Notebook 98.6%Language:Python 1.4%

bartongroup / FRAGSYS