allydunham / dms_mutations

Preliminary analysis of a combined DMS dataset, including clustering and VEP benchmark

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Preliminary deep mutational scanning analyses

This repo contains scripts and modules performing an initial analysis of a combined dataset of deep mutational scanning studies. The dataset developed into that released in Dunham & Beltrao (2021) and the final analysis used in that paper can be found in my AA Subtypes repo. A preliminary version of the deep landscape and amino acid subtypes analyses are included in this repo.

This version of the study processing pipeline is less automated, requiring DMS data to be downloaded manually for each study and placed in the correct data subdirectory. The processing for each study is in bin/data_processing/process_raw_dm_data.R, which can be used to determine which file to download and where to store it for each study. For most DMS analyses I would recommend downloading the processed data from Dunham & Beltrao (2021). However, that data only includes SIFT4G and FoldX scores, whereas this pipeline can be used to run Envision, EVCouplings and PolyPhen2 on the data and may be useful in some case. Another approach would be to add those tools to the Snakemake pipeline in the newer analysis.

In addition to the subtypes analysis, I used this data to perform a variant effect predictor benchmarking exercise, comparing predictor scores to the experimental results from deep mutational scanning. This analysis is included in my PhD thesis and a similar analysis including more predictors has now been published by Livesey and Marsh (2020).

The repo is structured as follows:

  • bin:
    • variant_effect_pipeline - Scripts to run variant effect predictors on DMS data.
    • data_processing - Scripts to process raw DMS data into standardised files.
    • analysis - Analysis scripts analysing the data
  • meta - Includes a description of the deep mutational scanning file type generated by the data pipeline and details of the FoldX force-field.
  • src - Modules required by the scripts in bin, including common functions and classes.

About

Preliminary analysis of a combined DMS dataset, including clustering and VEP benchmark

License:Other


Languages

Language:R 82.6%Language:Python 16.4%Language:Shell 1.1%