Question about example data
zhangnan0107 opened this issue · comments
Hi all,
I have recently read the EVE paper and got two questions.
The first one is PTEN alignment file related, I am wondering whether this file is before preprocessing, or it is the result after removing inadequate fragments and columns? Since I notice that all of the sequences in the alignment have the same length, which may not be a usual condition in an initial a2m result from jackhmmer (please just point out if I were wrong)?
Second, I am a bit confused about the description of 0.3 bit/residue reference part in the paper. Did you mean using 0.3 multiplied by the length of the target sequence as start values for jackhmmer parameters -T
, --domT
, --incT
, --incdomT
? If so, it seems there are sequences (e.g. UniRef100_A0A4U5VQ93) not satisfying the condition of Lcov >= 0.7L (for PTEN should be 0.7*403=282, the number of valid residue in UniRef100_A0A4U5VQ93 is only 204). The total number of alignment sequences is also less than 10L, I guess this is because when using 0.7L threshold, the 10L requirement will be automatically ignored?
Could you help me with the above questions? Thanks a lot.
Best,
Nan
Dear Nan,
Thank you for taking an interest in our work and sorry for the slow response.
Regarding your first question: yes, you are right about the jackhmmer flags. The PTEN file you are referring to is a preprocessed file. We have now added:
- additional MSA examples in the github repo (https://github.com/OATML-Markslab/EVE/tree/master/data/MSA)
- a more detailed description of this process in the README file (https://github.com/OATML-Markslab/EVE/blob/master/README.md#msa-creation)
We have also made all multiple sequence alignments available for download from our website evemodel.org. Please let us know if this does not fully address your question.
Regarding your second question: the condition used in the jackhmmer search is not that every sequence has Lcov* >= 0.7L., but rather that each sequence has Lcov* >= 0.5L. The way we obtained our multiple sequence alignments was to run jackhmmer for a range of bitscore thresholds. Then given multiple options, we pick one based on criteria relating to coverage and number of sequences in the alignment. You are correct that the number of sequences in the alignment being greater than 10L is a lower priority than Lcov >= 0.7L.
Finally, please note we have moved the official repo to the following address: https://github.com/OATML-Markslab/EVE.
Best regards,
The EVE team