Graylab / IgLM

Generative Language Modeling for Antibody Design

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add demo data file

jusjosgra opened this issue · comments

I tried running the CDR redesign example you provide by downloading the fasta file for 1JPT from PDB however the file format does not appear to be compatible with the expectations of IgLM. Could you provide an explicit example of what the input files should look like please? This would help with accessibility.

Thanks for providing the code & in general the interface is very usable.

I am able to initiate infilling with the following fasta file and command

fasta

>1JPT_1
DIQMTQSPSSLSASVGDRVTITCRASRDIKSYLNWYQQKPGKAPKVLIYYATSLAEGVPSRFSGSGSGTDYTLTISSLQPEDFATYYCLQHGESPWTFGQGTKVEIKRTVAAPSVFIFPPSDEQLKSGTASVVCLLNNFYPREAKVQWKVDNALQSGNSQESVTEQDSKDSTYSLSSTLTLSKADYEKHKVYACEVTHQGLSSPVTKSFNRGEC
>1JPT_2
EVQLVESGGGLVQPGGSLRLSCAASGFNIKEYYMHWVRQAPGKGLEWVGLIDPEQGNTIYDPKFQDRATISADNSKNTAYLQMNSLRAEDTAVYYCARDTAAYFDYWGQGTLVTVSSASTKGPSVFPLAPSSKSTSGGTAALGCLVKDYFPEPVTVSWNSGALTSGVHTFPAVLQSSGLYSLSSVVTVPSSSLGTQTYICNVNHKPSNTKVDKKVEPKSCDKTHT

command
iglm_infill data/antibodies/rcsb_pdb_1JPT.fasta 1JPT_2 98 106 --chain_token [HEAVY] --species_token [HUMAN] --num_seqs 100

However this results in the following error (seemingly produced in an infinite loop)

Input length of input_ids is 221, but `max_length` is set to 150. This can lead to unexpected behavior. You should consider increasing `max_new_tokens`.
Input length of input_ids is 221, but `max_length` is set to 150. This can lead to unexpected behavior. You should consider increasing `max_new_tokens`.
Input length of input_ids is 221, but `max_length` is set to 150. This can lead to unexpected behavior. You should consider increasing `max_new_tokens`.

I have tried updating max_length to 300 but this had no effect.

I was able to successfully run CDR infill by truncating the length of the input sequence however this is not really a solution, in particular since I am working with the example PDB you provide. Could you advise how I should use this with Ab sequences that are longer than 150 amino acids?