Mean Dimension of Generative Models for Protein Sequences

This repository contains the code for reproducing all plots in the paper.

Setup

Clone repository and enter the directory

Make sure git-lfs is installed by running

$ git lfs install --skip-repo

on the shell.

Then, execute

$ git clone git@github.com:christophfeinauer/ProteinMeanDimension.git && cd ProteinMeanDimension

This downloads the code and the data for the analysis of the four mutational datasets. The MSAs contained here are a subset of the DeepSequence MSAs (see below) with some additional information about mutational effects.

Create Julia Environment

$ julia --project=. -e 'using Pkg; Pkg.instantiate(;verbose=true)'

Download Data

Run

bash get_data.sh

to get the Deepsequence alignments. These are used for the calculation of the mean dimension for all Deepsequence alignments.

Train ArDCA

The instructions below are for running the lambdaJ sweep and also for calculating the mean dimension based on the DeepSequence alignments. If you need only one you can skip the parts you do not need.

Enter the ardca folder and start Julia (replace 32 with the number of threads you want to use):

cd ardca && julia -t 32

In Julia, activate the environment and include the code:

julia> activate ..
julia> include("ardca.jl")

Run logarithmic sweep over different values for lambdaJ (the default arguments direct the function to the correct folders):

julia> train_folder_loglambdaJsweep()

The models are placed in the models folder.

Attention: This results in about 44 gigabytes of model files.

Run on all DeeqpSequence alignments (the default arguments direct the function to the correct folders):

julia> train_folder()

The models are place in the models folder.

Create samples:

julia> include("sample.jl")
julia> create_samples_folder("./models")

If you also want the Spearman correlation for the DMS datasets, run

julia> calculate_sr()

Train Variational Autoencoder

Calculating the mean dimension on a new model

For calculating the mean dimension for a new model you can use the code in mean_dimension_from_samples.py. It expects a HDF5 file with two datasets in it, one called samples, which should contain the protein sequences with amino acids mapped to integer indices, and one called logp, which should contain the corresponding log probabilities for the samples. For the models used in the paper (ArDCA and VAE), the scripts in ardca/ and vae/ will produce these files.

The layout of these datasets is a bit intricate since the calculation of the log probability and the evaluation of the mean dimension is decoupled in this code, to make it more efficient.

The mean dimension is based on estimating the contribution of single positions to the variance of the log probability under the uniform distribution. The input to the code is therefore the log probability of sequences where single amino acids have been exchanged.

For every of the N positions the calculation of the mean dimension is based on nsamples_per_position samples. The dataset samples should therefore be of size (nsamples_per_position, N, q, N), where q is the number of possible amino acids (typically q=21) and (:, i, :, :) are the sequences used for estimating the contribution of position i, where (m, i, a, :) is a single sequence of length N. For a given index m, the sequences (m, i, :, :) should only differ in the position i and (m, i, a, :) should contain an a in position i. The code then uses comparisons between (m, i, a, :) and (m, i, b, :) for calculating the mean dimension.

The dataset logp should be of size (nsamples_per_position, N, q) and contain at index (m, i, a) the log probability in the model of the sequence in (m, i, a, :).

To make this more clear and assuming that the function get_logp(seq) calculates the log probability of a sequence in the model, then the following pseudocode illustrates how to create the datasets:

logp = zeros(nsamples_per_position, N ,q)
samples = zeros(nsamples_per_position, N, q ,N)
for i in 1:N
    for m in 1:nsamples_per_position
        seq = rand(1:q, N)
        for a in 1:q
            seq[i] = a
            logp[m, i, a] = get_logp(seq)
            samples[m, i, a, :] = seq[:]
        end
    end
end

The code in mean_dimension_from_samples.py uses samples only for consistency checks. If you are sure that you got everything right you can comment out these and pass only logp.

christophfeinauer / ProteinMeanDimension