OATML / EVE

Official repository for the paper "Large-scale clinical interpretation of genetic variants using evolutionary data and deep learning". Joint collaboration between the Marks lab and the OATML group.

Home Page:http://evemodel.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Question about example data

zhangnan0107 opened this issue · comments

Hi all,

I have recently read the EVE paper and got two questions.

The first one is PTEN alignment file related, I am wondering whether this file is before preprocessing, or it is the result after removing inadequate fragments and columns? Since I notice that all of the sequences in the alignment have the same length, which may not be a usual condition in an initial a2m result from jackhmmer (please just point out if I were wrong)?

Second, I am a bit confused about the description of 0.3 bit/residue reference part in the paper. Did you mean using 0.3 multiplied by the length of the target sequence as start values for jackhmmer parameters -T, --domT, --incT, --incdomT? If so, it seems there are sequences (e.g. UniRef100_A0A4U5VQ93) not satisfying the condition of Lcov >= 0.7L (for PTEN should be 0.7*403=282, the number of valid residue in UniRef100_A0A4U5VQ93 is only 204). The total number of alignment sequences is also less than 10L, I guess this is because when using 0.7L threshold, the 10L requirement will be automatically ignored?

Could you help me with the above questions? Thanks a lot.

Best,
Nan

Dear Nan,

Thank you for taking an interest in our work and sorry for the slow response.

Regarding your first question: yes, you are right about the jackhmmer flags. The PTEN file you are referring to is a preprocessed file. We have now added:

Regarding your second question: the condition used in the jackhmmer search is not that every sequence has Lcov* >= 0.7L., but rather that each sequence has Lcov* >= 0.5L. The way we obtained our multiple sequence alignments was to run jackhmmer for a range of bitscore thresholds. Then given multiple options, we pick one based on criteria relating to coverage and number of sequences in the alignment. You are correct that the number of sequences in the alignment being greater than 10L is a lower priority than Lcov >= 0.7L.

Finally, please note we have moved the official repo to the following address: https://github.com/OATML-Markslab/EVE.

Best regards,
The EVE team