luost26 / diffab

✌🏻 Antigen-Specific Antibody Design and Optimization with Diffusion-Based Generative Models for Protein Structures (NeurIPS 2022)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Issue with non-reproducibility in the energy calculation

alexandrefassio opened this issue · comments

Hi,

As a proof-of-concept, I've been trying to sample the H-CDR3 structure of a given antibody to see if your method would be able to reproduce the crystal structure conformation. For the designed structures I'm especially interested in identifying those conformations where the RMSD is <= 2 Å.

I'm not sure if my protocol is correct, but I've been running it as follows:

  • Step 1: run design_pdb.py with the config file 'strpred.yml'. I've modified it to sample only the H-CDR3.
  • Step 2: run diffab/tools/relax/run.py with the pipeline set to 'pyrosetta' to relax the sampled structures.
  • Step 3: run diffab/tools/eval/run.py to obtain RMSDs and energies.

With this protocol, I was able to identify some conformations where the RMSD was smaller than 2 Å.

However, I noticed that across different runs, the energy for the reference structure was changing, which would make my analysis not comparable and non-reproducible. For instance, the absolute difference between the dG_ref in two runs was 100, which is a lot.

So, I started debugging the code and found out that 'diffab/tools/relax/run.py' relaxes the reference and 'diffab/tools/eval/run.py' uses it as the reference to calculate the energy. That explains why across different runs I got different dG_refs. I then set PyRosetta to use the same seed in 'diffab/tools/relax/run.py' and I was able to get the same values across different calls of 'diffab/tools/eval/run.py'.

Despite that, I'm still getting minor differences for the reference within the same run. See an example below:

image

The difference is small, but it is strange to have different energies for the same complex. I tried to set PyRosetta's seed in 'diffab/tools/eval/run.py', but it didn't work.

Would you know what is going on here?

thank you for this very great tool and for your help.
bests

Hi,
I have the same problem as you, and i wonder how did you set PyRosetta to use the same seed in 'diffab/tools/relax/run.py'?
I look forward to your prompt reply.

Hi @userscy

After debugging the code for a while I figured out two things about this issue.

  1. You need to modify line # 17 from diffab/utils/transforms.merge.py as shown below:

chains = {c: i for i, c in enumerate(sorted(chains))}

The original code doesn't have the sorted(), which causes the sampled structures (outputted PDB files) to have random chain orders. This does make a difference in reproducibility.

Finally, to set the seed in Pyrosetta, you need to add the flag -constant_seed into the init() method in diffab/tools/relax/pyrosetta_relaxer.py. It will look like this:

pyrosetta.init(' '.join([
    '-mute', 'all',
    '-use_input_sc',
    '-ignore_unrecognized_res',
    '-ignore_zero_occupancy', 'false',
    '-load_PDB_components', 'false',
    '-relax:default_repeats', '2',
    '-no_fconfig',
    '-constant_seed'
]))
  1. However, despite achieving reproducible runs with this strategy, I found out that comparison across different CDRs is not possible with the Pyrosetta protocol. That's because the protocol they've implemented in DiffAb minimizes only CDR residues and their immediate neighbors. That means the final minimized REF structure will be different across CDRs. More specifically, if you design sequences for CDR1 and CDR3, for instance, then you cannot compare the dG results between them.

It seems to me that the OpenMM protocol minimizes the whole structure, which makes more sense. So I've been using it instead.

Let me know if this works for you.

Have you achieved any interesting results on this matter as well?

bests