ManvithaPonnapati / esm

Evolutionary Scale Modeling (esm): Pretrained language models for proteins

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Evolutionary Scale Modeling (ESM)

Pretrained language models for proteins

This repository contains a PyTorch implementation of and pre-trained weights for the transformer protein language models in "Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences" (Rives et al., 2019) from Facebook AI Research:

@article{rives2019biological,
  author={Rives, Alexander and Meier, Joshua and Sercu, Tom and Goyal, Siddharth and Lin, Zeming and Guo, Demi and Ott, Myle and Zitnick, C. Lawrence and Ma, Jerry and Fergus, Rob},
  title={Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences},
  year={2019},
  doi={10.1101/622803},
  url={https://www.biorxiv.org/content/10.1101/622803v3},
  journal={bioRxiv}
}

Quickstart

As a prerequisite, you must have PyTorch 1.5 or later installed to use this repository. A cuda device is optional and will be auto-detected.

You can either work in the root of this repository, or use this one-liner for installation:

$ pip install git+https://github.com/facebookresearch/esm.git

Then, you can load and use a pretrained model as follows:

import torch
import esm

# Load 34 layer model
model, alphabet = esm.pretrained.esm1_t34_670M_UR50S()
batch_converter = alphabet.get_batch_converter()

# Prepare data (two protein sequences)
data = [("protein1", "MYLYQKIKN"), ("protein2", "MNAKYD")]
batch_labels, batch_strs, batch_tokens = batch_converter(data)

# Extract per-residue embeddings (on CPU)
with torch.no_grad():
    results = model(batch_tokens, repr_layers=[34])
token_embeddings = results["representations"][34]

# Generate per-sequence embeddings via averaging
# NOTE: token 0 is always a beginning-of-sequence token, so the first residue is token 1.
sequence_embeddings = []
for i, (_, seq) in enumerate(data):
    sequence_embeddings.append(token_embeddings[i, 1:len(seq) + 1].mean(0))

We also support PyTorch Hub, which removes the need to clone and/or install this repository yourself:

import torch

model, alphabet = torch.hub.load("facebookresearch/esm", "esm1_t34_670M_UR50S")

FASTA embedding extractor

For your convenience, we have provided a script that efficiently extracts embeddings in bulk from a FASTA file:

# Extract final-layer embedding for a FASTA file from a 34-layer model
$ python extract.py esm1_t34_670M_UR50S examples/some_proteins.fasta my_reprs/ \
    --repr_layers 0 32 34 --include mean per_tok



# my_reprs/ now contains one ".pt" file per FASTA sequence; use torch.load() to load them
# extract.py has flags that determine what's included in the ".pt" file:
# --repr-layers (default: final only) selects which layers to include embeddings from.
# --include specifies what embeddings to save. You can use the following:
# * per_tok includes the full sequence, with an embedding per amino acid (seq_len x hidden_dim).
# * mean includes the embeddings averaged over the full sequence, per layer.
# * bos includes the embeddings from the beginning-of-sequence token.
#    (NOTE: Don't use with the pre-trained models - we trained without bos-token supervision)

Tutorial

ImageLink

To help you get started, we provide a jupyter notebook tutorial demonstrating how to train a variant predictor using embeddings from ESM. You can adopt a similar protocol to train a model for any downstream task, even with limited data. First you can obtain the embeddings for examples/P62593.fasta either by downloading the precomputed embeddings as instructed in the notebook or by running the following:

# Obtain the embeddings
$ python extract.py esm1_t34_670M_UR50S examples/P62593.fasta examples/P62593_reprs/ \
    --repr_layers 34 --include mean

Then, follow the remaining instructions in the tutorial. You can also run the tutorial in a colab notebook.

Available models

The following table lists the pretrained models available for use. See also Table 1 in the paper.

Shorthand Full Name #layers #params Dataset Embedding Dim Perplexity/ECE Model URL
ESM1-main esm1_t34_670M_UR50S 34 670M UR50/S 1280 8.54 https://dl.fbaipublicfiles.com/fair-esm/models/esm1_t34_670M_UR50S.pt
  esm1_t34_670M_UR50D 34 670M UR50/D 1280 8.46 https://dl.fbaipublicfiles.com/fair-esm/models/esm1_t34_670M_UR50D.pt
  esm1_t34_670M_UR100 34 670M UR100 1280 10.32 https://dl.fbaipublicfiles.com/fair-esm/models/esm1_t34_670M_UR100.pt
  esm1_t12_85M_UR50S 12 85M UR50/S 768 10.45 https://dl.fbaipublicfiles.com/fair-esm/models/esm1_t12_85M_UR50S.pt
  esm1_t6_43M_UR50S 6 43M UR50/S 768 11.79 https://dl.fbaipublicfiles.com/fair-esm/models/esm1_t6_43M_UR50S.pt

Comparison to related work

This table compares to related pre-training methods, and corresponds to Table 8 in the paper. The last 3 columns are the major benchmark results:

  • RH: Remote Homology at the fold level, using Hit-10 metric on SCOP.
  • SSP: Secondary structure Q8 accuracy on CB513.
  • Contact: Top-L long range contact precision on RaptorX test set from Wang et al. (2017).
Model Pre-training Params RH SSP Contact
UniRep   18M .527 58.4 21.9
SeqVec   93M .545 62.1 29.0
TAPE   38M .581 58.0 23.2
LSTM biLM (S) UR50/S 28M .558 60.4 24.1
LSTM biLM (L) UR50/S 113M .574 62.4 27.8
Transformer-6 UR50/S 43M .653 62.0 30.2
Transformer-12 UR50/S 85M .639 65.4 37.7
Transformer-34 UR100 670M .599 64.3 32.7
Transformer-34 UR50/S 670M .639 69.2 50.2

Performance on TAPE benchmark

We evaluated our best performing model on the TAPE benchmark (Rao, et al. 2019), finding that our neural embeddings perform similarly to or better than alignment-based methods.

Model SS3 SS8 Remote homology Fluorescence Stability Contact
ESM (best neural) 0.82 0.67 0.33 0.68 0.71 (0.61)*
TAPE (best neural) 0.75 0.59 0.26 0.68 0.73 0.4
TAPE (alignment) 0.8 0.63 0.09 N/A N/A 0.64

* Not comparable: ESM (bests neural) uses a linear projection on the features (the contact head available in the PyTorch version of TAPE), but the results from the TAPE paper use a ResNet head. See the previous table for a rigorous comparison of ESM and TAPE in a fair benchmarking setup.

Reference

If you find the model useful in your research, we ask that you cite the following paper:

@article{rives2019biological,
  author={Rives, Alexander and Meier, Joshua and Sercu, Tom and Goyal, Siddharth and Lin, Zeming and Guo, Demi and Ott, Myle and Zitnick, C. Lawrence and Ma, Jerry and Fergus, Rob},
  title={Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences},
  year={2019},
  doi={10.1101/622803},
  url={https://www.biorxiv.org/content/10.1101/622803v3},
  journal={bioRxiv}
}

Additionally, much of this code hails from the excellent fairseq sequence modeling framework; we have released this standalone model to facilitate more lightweight and flexible usage. We encourage those who wish to pretrain protein language models from scratch to use fairseq.

License

This source code is licensed under the MIT license found in the LICENSE file in the root directory of this source tree.

About

Evolutionary Scale Modeling (esm): Pretrained language models for proteins

License:MIT License


Languages

Language:Python 100.0%