althonos / mini3di

A NumPy port of the foldseek code for encoding protein structures to 3di.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Confirming results against foldseek

johnlees opened this issue · comments

Can I check that I should expect to get the same results with the following use?

Using 5uak.pdb with foldseek 8.ef4e960:

foldseek createdb 5uak.pdb 5uak_DB
foldseek lndb 5uak_DB 5uak_DB_ss_h
foldseek convert2fasta 5uak_DB_ss_h 5uak.fasta

I get:

>5uak.pdb_A DEPHOSPHORYLATED, ATP-FREE HUMAN CYSTIC FIBROSIS TRANSMEMBRANE CONDUCTANCE REGULATOR (CFTR)
PLEKASVVSKLFFSWTRPILRKGYRQRLELSDIYQIPSVDSADNLSEKLEREWDRELASKKNPKLINALRRCFFWRFMFYGIFLYLGEVTKAVQPLLLGRIIASYDPDNKEERSIAIYLGIGLCLLFIVRTLLLHPAIFGLHHIGMQMRIAMFSLIYKKTLKLSSRVLDKISIGQLVSLLSNNLNKFDEGLALAHFVWIAPLQVALLMGLIWELLQASAFCGLGFLIVLALFQAGLGRMMMKYRDQRAGKISERLVITSEMIENIQSVKAYCWEEAMEKMIENLRQTELKLTRKAAYVRYFNSSAFFFSGFFVVFLSVLPYALIKGIILRKIFTTISFCIVLRMAVTRQFPWAVQTWYDSLGAINKIQDFLQKQEYKTLEYNLTTTEVVMENVTAFWEPVLKDINFKIERGQLLAVAGSTGAGKTSLLMVIMGELEPSEGKIKHSGRISFCSQFSWIMPGTIKENIIFGVSYDEYRYRSVIKACQLEEDISKFAEKDNIVLGEGGITLSGGQRARISLARAVYKDADLYLLDSPFGYLDVLTEKEIFESCVCKLMANKTRILVTSKMEHLKKADKILILHEGSSYFYGTFSELQNLQPDFSSKLMTTWNTYLRYITVHKSLIFVLIWCLVIFLAEVAASLVVLWLSTSSYYVFYIYVGVADTLLAMGFFRGLPLVHTLITVSKILHHKMLHSVLQAPMSTLNTLKAGGILNRFSKDIAILDDLLPLTIFDFIQLLLIVIGAIAVVAVLQPYIFVATVPVIVAFIMLRAYFLQTSQQLKQLESEGRSPIFTHLVTSLKGLWTLRAFGRQPYFETLFHKALNLHTANWFLYLSTLRWFQMRIEMIFVIFFIAVTFISILTTGEGEGRVGIILTLAMNIMSTLQWAVNSSIDVDSLMRSVSRVFKFIDMPTEGGQMTVKDLTAKYTEGGNAILENISFSISPGQRVGLLGRTGSGKSTLLSAFLRLLNTEGEIQIDGVSWDSITLQQWRKAFGVIPQKVFIFSGTFRKNLDPYEQWSDQEIWKVADEVGLRSVIEQFPGKLDFVLVDGGCVLSHGHKQLMCLARSVLSKAKILLLDEPSAHLDPVTYQIIRRTLKQAFADCTVILCEHRIEAMLECQQFLVIEENKVRQYDSIQKLLNERSL

Using mini3di (main branch) as follows:

import mini3di
from Bio.PDB import PDBParser
parser = PDBParser()
encoder = mini3di.Encoder()
struct = parser.get_structure("5uak", "5uak.pdb")
for chain in struct.get_chains():
    states = encoder.encode_chain(chain)
    sequence = encoder.build_sequence(states)
    print(chain.get_id(), sequence)

I get:

A DADDDDPVCVLLVPVVVVVLVVLVVDFDDLVNFDFDHPCDFLVNLQVQLVVLVVVQVPPDPDRDPVVSLCVVVVPVLVVLLVLVLVLLVVLLVLLVLVLVLLVVPDPDDPDDLVSNVVSVVVSLVSLVVSLSSNLVSLLSQLVVLSSSLSNVVLNVVVVVLFAAVLVVVPPCVVVVVVLNVPCNPLRRNLSSLLSCQVSLVVLVVVLCVVPDALVHPLSCVLVVVVVVVLVVLVVLVVVLVVLVVVLVVLLVVLLVVVVLCLVPVVVCLLQVDQPLVLVVNVVSVVVSLVSLVSSVVSVQVSVLCLLQVLCVSLVVRPVVSCVVDNDDQSVSLSVSSSVVSNSCSPRPRNVSNVVSVVVNVVSSVVVVVVSPHDTDDQDAFPDPQQWWFWWQFAADDVVLAGDDGDIDHQQFEEEEEEDPPSCLVVVVCVVSPSDDTPDIDTHHHKAEFEFEPDQADAWWFLVCLLCPPPDDDVVVVVVLCVLLVCVVVQVVVVPGRTDGAFFPGVPDDPLVSLSSSVSSRLNDDIQEREYECSPDPDDVPVSCVSVVVCVVPVSPRHYYYHYDDDVVVLQVGFWYWYDYRNHTPDTGHVVVCCVSPVPVVVVVCDDPCLVVVLQPPADVLVVLVVLLVVVLVVQLVVLVVLVVDSCPPVSVVCSVSLSLSNSQADDDPRRDCNLVRQLVSLVVLLSLLVSLLPQAALQVNVVDDSVVVSCLSSPLSVCSNNVQSVLVNLLVSLVSSVVSLQVVLCVLVVVLVVVVPVLVVCLVVVLVSLVSVLVSLVVCLVVLVVVLVSLVVSCSVCSVSCSSVVCVVVSSVVNSVSVRSNCSSVSVNSSSVSSVVSVNVVSLSVSVSVSLVCLSPDDDPDSVSSSSSNSSSPPSVVSVVVNSVSVVVSSVSSVSVVVSVVSSPHDGFLAKKWFFQKWFDAHVVHDGLDGGETDIDHHQFFEEEAEDPPSDPVVVVCVVLPNGGIDGWMDGVPHGDPPDGSSVSSLQEAEDDPDFRDDQWWQCCLLCSPPDDDPVVLQVLCVLLPVNPVQVPDPDRGGDGCPPRNPPDDSLVSSSSSVSSSLVSPHQEYEYDDPPPPDPPVSSCSNVVRVCVSSVRGHYYYYDPDPVVCQVGQKYWYDDDNYIDIDGHCVRPVVPVPD
R DDDDDPVVVVVVVVVVVPD

Similarly, on the 8crb.pdb in the test suite, running the foldseek commands as above gives:

>8crb.pdb_A CRYO-EM STRUCTURE OF PCRV/FAB(11-E5)
EVQLLESGGGLVQPGGSLRLSCAASGFSFSSYAMSWVRQAPGKGLEWVSAISGSGGITYYGDSAKGRFTISRDNSKNTLYLEMSSLRADDTAVYYCAQERYCDSGSCYERDPVFEYWGQGTRVTVSSASTKGPSVFPLAPSSKSTSGGTAALGCLVKDYFPEPVTVSWNSGALTSGVHTFPAVLQSSGLYSLSSVVTVPSSSLGTQTYICNVNHKPSNTKVDKRVEP
>8crb.pdb_B CRYO-EM STRUCTURE OF PCRV/FAB(11-E5)
QSVLTQPPSASGAPGQRVTISCSGSNSNIGTYFVYWYQQLPGTAPKVLIYRNDQRPSGVPDRISGSKSGTSASLAISGLRSEDEADYYCASWDASLRGYVFGPGTKVTVLGQPKAAPSVTLFPPSSEELQANKATLVCLISDFYPGAVTVAWKADSSPVKAGVETTTPSKQSNNKYAASSYLSLTPEQWKSHRSYSCQVTHEGSTVEKTVAPTECS
>8crb.pdb_C CRYO-EM STRUCTURE OF PCRV/FAB(11-E5)
KRKALLDELKALTAELKVYSVIQSQINAALSAKQGIRIDAGGIDLVDPTLYGYAVGDPRWKDSPEYALLSNLDTFSGKLSIKDFLSGSPKQSGELKGLKDEYPFEKDNNPVGNFATTVSDRSRPLNDKVNEKTTLLNDT

Which is quite different from the assert in the tests

This is because you are not using the right commands to generate the foldseek states; you need to do:

foldseek createdb 5uak.pdb 5uak_DB
foldseek lndb 5uak_DB_h 5uak_DB_ss_h           # link the header, not the full db
foldseek convert2fasta 5uak_DB_ss 5uak.fasta  # convert the secondary structure, not the header

see steineggerlab/foldseek#15 (comment)

Oh excellent, thank you so much! That's great and does match. Excited to use this 🙂

I'm happy to make a release to PyPI if you plan on using this externally 👍

Yes, that would be really helpful!

Done 😄