Merck / BioPhi

BioPhi is an open-source antibody design platform. It features methods for automated antibody humanization (Sapiens), humanness evaluation (OASis) and an interface for computer-assisted antibody sequence design.

Home Page:https://biophi.dichlab.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Use of Sapiens model for antibody representation?

ScubaChris opened this issue · comments

Greetings, and thank you for your work!

I have been searching for a transformer model that has been trained on antibody sequences so I can extract learned representations of my dataset, and this repo has been suggested to me.

After tokenizing a sequence using the dictionary.encode_line method, I call the sentence_encoder to get the encoding of my sequence. In your opinion, for downstream tasks, should I ideally be working with the 'encoder_embedding' values of the output dictionary, or with the 'encoder_out' values?

More importantly, is there a built-in method that I can use to get reconstructed sequences back? I wrote a linear decoder that tries to reconstruct the 'encoder_out' values back to the original tokenized sequences, and although it does a somewhat decent job, it is not good enough for my downstream tasks.

Thank you for your time!

Hi @ScubaChris, thanks. Although we are interested in the embeddings, we haven't looked into them much yet, only into the attention. So not sure about encoder_out vs encoder_embedding.

As for reconstructing sequences, our model is not an auto-encoder - it is not trained to reconstruct the input. It's trained on human sequences using masked language modeling, so you could say that it's trained to reconstruct & de-noise (therefore humanize) the input. So the single default RoBERTa head that's included with the Sapiens model is doing just that - taking the embedding and predicting the humanized form of the input sequence. So if you don't mind getting a humanized sequence, you can already use that classification head.

If you wanted to get the original input sequence back, you would probably need to train a new classification head, but it would need to predict a class for each token & also you would need to freeze the encoder to make sure you only train the new head and not the full network. Not sure how well this is currently supported in fairseq.

Also keep in mind that the representation is more of a "residue representation" than a "sequence representation", since it was trained on a sequence-to-sequence task and not a sequence-to-class task. So if you want to get just a single vector for the whole sequence, you would need to figure out how to get it from the residue vectors (e.g. mean). Taking the BOS/EOS token representation won't work, because we're not using that token to train the classifier (as was the case in BERT).

Thank you very much for your reply!
I already freeze the encoder, and my decoder does exactly that, predicts a class for each token, although like I said, it's just a linear block for now. Since the encoder out dimension per residue is not that large (128 if I remember correctly), I just concatenate the residue vectors to get the full sequence representation. Using that representation with a downstream binary classifier (2 target antigens), gives better results than by using the mean of the residue vectors.

I will see about training a more sophisticated decoder, and perhaps it will work well enough eventually!

Sounds exciting, keep us posted :)

About concatenating the vectors - If I understand your approach correctly, this might have some caveats, because we don't align the sequence, so you will get a variable number of feature dimensions based on the length of the sequence (there will be some whitespace token vectors at the end, after the EOS token). So the nth block of the 128 feature vectors corresponds to the nth residue in raw numbering, which might not be what you want.