Confirming results against foldseek
johnlees opened this issue · comments
Can I check that I should expect to get the same results with the following use?
Using 5uak.pdb with foldseek 8.ef4e960:
foldseek createdb 5uak.pdb 5uak_DB
foldseek lndb 5uak_DB 5uak_DB_ss_h
foldseek convert2fasta 5uak_DB_ss_h 5uak.fasta
I get:
>5uak.pdb_A DEPHOSPHORYLATED, ATP-FREE HUMAN CYSTIC FIBROSIS TRANSMEMBRANE CONDUCTANCE REGULATOR (CFTR)
PLEKASVVSKLFFSWTRPILRKGYRQRLELSDIYQIPSVDSADNLSEKLEREWDRELASKKNPKLINALRRCFFWRFMFYGIFLYLGEVTKAVQPLLLGRIIASYDPDNKEERSIAIYLGIGLCLLFIVRTLLLHPAIFGLHHIGMQMRIAMFSLIYKKTLKLSSRVLDKISIGQLVSLLSNNLNKFDEGLALAHFVWIAPLQVALLMGLIWELLQASAFCGLGFLIVLALFQAGLGRMMMKYRDQRAGKISERLVITSEMIENIQSVKAYCWEEAMEKMIENLRQTELKLTRKAAYVRYFNSSAFFFSGFFVVFLSVLPYALIKGIILRKIFTTISFCIVLRMAVTRQFPWAVQTWYDSLGAINKIQDFLQKQEYKTLEYNLTTTEVVMENVTAFWEPVLKDINFKIERGQLLAVAGSTGAGKTSLLMVIMGELEPSEGKIKHSGRISFCSQFSWIMPGTIKENIIFGVSYDEYRYRSVIKACQLEEDISKFAEKDNIVLGEGGITLSGGQRARISLARAVYKDADLYLLDSPFGYLDVLTEKEIFESCVCKLMANKTRILVTSKMEHLKKADKILILHEGSSYFYGTFSELQNLQPDFSSKLMTTWNTYLRYITVHKSLIFVLIWCLVIFLAEVAASLVVLWLSTSSYYVFYIYVGVADTLLAMGFFRGLPLVHTLITVSKILHHKMLHSVLQAPMSTLNTLKAGGILNRFSKDIAILDDLLPLTIFDFIQLLLIVIGAIAVVAVLQPYIFVATVPVIVAFIMLRAYFLQTSQQLKQLESEGRSPIFTHLVTSLKGLWTLRAFGRQPYFETLFHKALNLHTANWFLYLSTLRWFQMRIEMIFVIFFIAVTFISILTTGEGEGRVGIILTLAMNIMSTLQWAVNSSIDVDSLMRSVSRVFKFIDMPTEGGQMTVKDLTAKYTEGGNAILENISFSISPGQRVGLLGRTGSGKSTLLSAFLRLLNTEGEIQIDGVSWDSITLQQWRKAFGVIPQKVFIFSGTFRKNLDPYEQWSDQEIWKVADEVGLRSVIEQFPGKLDFVLVDGGCVLSHGHKQLMCLARSVLSKAKILLLDEPSAHLDPVTYQIIRRTLKQAFADCTVILCEHRIEAMLECQQFLVIEENKVRQYDSIQKLLNERSL
Using mini3di (main branch) as follows:
import mini3di
from Bio.PDB import PDBParser
parser = PDBParser()
encoder = mini3di.Encoder()
struct = parser.get_structure("5uak", "5uak.pdb")
for chain in struct.get_chains():
states = encoder.encode_chain(chain)
sequence = encoder.build_sequence(states)
print(chain.get_id(), sequence)
I get:
A DADDDDPVCVLLVPVVVVVLVVLVVDFDDLVNFDFDHPCDFLVNLQVQLVVLVVVQVPPDPDRDPVVSLCVVVVPVLVVLLVLVLVLLVVLLVLLVLVLVLLVVPDPDDPDDLVSNVVSVVVSLVSLVVSLSSNLVSLLSQLVVLSSSLSNVVLNVVVVVLFAAVLVVVPPCVVVVVVLNVPCNPLRRNLSSLLSCQVSLVVLVVVLCVVPDALVHPLSCVLVVVVVVVLVVLVVLVVVLVVLVVVLVVLLVVLLVVVVLCLVPVVVCLLQVDQPLVLVVNVVSVVVSLVSLVSSVVSVQVSVLCLLQVLCVSLVVRPVVSCVVDNDDQSVSLSVSSSVVSNSCSPRPRNVSNVVSVVVNVVSSVVVVVVSPHDTDDQDAFPDPQQWWFWWQFAADDVVLAGDDGDIDHQQFEEEEEEDPPSCLVVVVCVVSPSDDTPDIDTHHHKAEFEFEPDQADAWWFLVCLLCPPPDDDVVVVVVLCVLLVCVVVQVVVVPGRTDGAFFPGVPDDPLVSLSSSVSSRLNDDIQEREYECSPDPDDVPVSCVSVVVCVVPVSPRHYYYHYDDDVVVLQVGFWYWYDYRNHTPDTGHVVVCCVSPVPVVVVVCDDPCLVVVLQPPADVLVVLVVLLVVVLVVQLVVLVVLVVDSCPPVSVVCSVSLSLSNSQADDDPRRDCNLVRQLVSLVVLLSLLVSLLPQAALQVNVVDDSVVVSCLSSPLSVCSNNVQSVLVNLLVSLVSSVVSLQVVLCVLVVVLVVVVPVLVVCLVVVLVSLVSVLVSLVVCLVVLVVVLVSLVVSCSVCSVSCSSVVCVVVSSVVNSVSVRSNCSSVSVNSSSVSSVVSVNVVSLSVSVSVSLVCLSPDDDPDSVSSSSSNSSSPPSVVSVVVNSVSVVVSSVSSVSVVVSVVSSPHDGFLAKKWFFQKWFDAHVVHDGLDGGETDIDHHQFFEEEAEDPPSDPVVVVCVVLPNGGIDGWMDGVPHGDPPDGSSVSSLQEAEDDPDFRDDQWWQCCLLCSPPDDDPVVLQVLCVLLPVNPVQVPDPDRGGDGCPPRNPPDDSLVSSSSSVSSSLVSPHQEYEYDDPPPPDPPVSSCSNVVRVCVSSVRGHYYYYDPDPVVCQVGQKYWYDDDNYIDIDGHCVRPVVPVPD
R DDDDDPVVVVVVVVVVVPD
Similarly, on the 8crb.pdb in the test suite, running the foldseek commands as above gives:
>8crb.pdb_A CRYO-EM STRUCTURE OF PCRV/FAB(11-E5)
EVQLLESGGGLVQPGGSLRLSCAASGFSFSSYAMSWVRQAPGKGLEWVSAISGSGGITYYGDSAKGRFTISRDNSKNTLYLEMSSLRADDTAVYYCAQERYCDSGSCYERDPVFEYWGQGTRVTVSSASTKGPSVFPLAPSSKSTSGGTAALGCLVKDYFPEPVTVSWNSGALTSGVHTFPAVLQSSGLYSLSSVVTVPSSSLGTQTYICNVNHKPSNTKVDKRVEP
>8crb.pdb_B CRYO-EM STRUCTURE OF PCRV/FAB(11-E5)
QSVLTQPPSASGAPGQRVTISCSGSNSNIGTYFVYWYQQLPGTAPKVLIYRNDQRPSGVPDRISGSKSGTSASLAISGLRSEDEADYYCASWDASLRGYVFGPGTKVTVLGQPKAAPSVTLFPPSSEELQANKATLVCLISDFYPGAVTVAWKADSSPVKAGVETTTPSKQSNNKYAASSYLSLTPEQWKSHRSYSCQVTHEGSTVEKTVAPTECS
>8crb.pdb_C CRYO-EM STRUCTURE OF PCRV/FAB(11-E5)
KRKALLDELKALTAELKVYSVIQSQINAALSAKQGIRIDAGGIDLVDPTLYGYAVGDPRWKDSPEYALLSNLDTFSGKLSIKDFLSGSPKQSGELKGLKDEYPFEKDNNPVGNFATTVSDRSRPLNDKVNEKTTLLNDT
Which is quite different from the assert in the tests
This is because you are not using the right commands to generate the foldseek states; you need to do:
foldseek createdb 5uak.pdb 5uak_DB
foldseek lndb 5uak_DB_h 5uak_DB_ss_h # link the header, not the full db
foldseek convert2fasta 5uak_DB_ss 5uak.fasta # convert the secondary structure, not the header
Oh excellent, thank you so much! That's great and does match. Excited to use this 🙂
I'm happy to make a release to PyPI if you plan on using this externally 👍
Yes, that would be really helpful!
Done 😄