instadeepai / nucleotide-transformer

🧬 Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics

Home Page:https://www.biorxiv.org/content/10.1101/2023.01.11.523679v2

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Inquiry Regarding Details of Section A.5.4

yangzhao1230 opened this issue · comments

I am particularly intrigued by the experiments outlined in section A.5.4, which focuses on Functional Variant Prioritization.

I am particularly intrigued by the experiments outlined in section A.5.4, which focuses on Functional Variant Prioritization. As I attempt to replicate this specific experiment, I have encountered some challenges and would greatly appreciate additional details to aid in my efforts. Specifically, I am interested in the following aspects:

  1. Embedding Extraction:

Could you please clarify from which layer of the Transformer the embeddings are extracted?

  1. Similarity Calculation:

In the calculation of similarity, is it based solely on the embeddings of tokens that have undergone mutations, or does it encompass the similarity of embeddings for the entire sequence?

  1. Binary Similarity Threshold:

What threshold value is employed for binary similarity in the two-class classification? Understanding this threshold is crucial for my replication efforts.

I have observed that the similarity between sequences with severe mutations tends to be exceptionally high (exceeding 0.999). To gain a deeper understanding and enhance the reproducibility of this experiment, I would be grateful for any additional insights or details you could provide.

Sorry for the late reply, @yangzhao1230

Regarding embedding extraction. We pulled out layers 12, 16, 21, 24, and 32. For the results shown in the figures, we used the layer that resulted in the highest performance for each score separately.

Regarding similarity calculation. We used the embeddings from the token containing the mutation.

Regarding the similarity threshold. Let me know if this is what you're referring to. But for our ROC analyses we used the scores as is. We didn't use a cutoff to classify the variants.

Hope this helps.