salesforce / progen

Official release of the ProGen models

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

correlation with Facebook ESM's log_likelihood

avilella opened this issue · comments

Hi all,

I've tested progen2 on an antibody Fv for which we've also calculated Facebook's ESM log_likelihood stability values, and we see very little correlation between the two. I've tested it both on model small (msma), model medium (mmed) and model OAS (moas), and the results for the Pearson's correlation are below. This is doing an AA scan (changing each aminoacid for the other 19 AA possible) for all positions in the heavy and light chain. The Fv is represented as heavy+(GGGGSx4)+light chain.

The prg2_ll_sum1 is the first likelihood sum value that appears in the output, the prg2_ll_sum3 is the second one (right to left?).

Is this expected? I naively was expecting to see a good correlation between these likelihood values and Facebook's ESM likelihood values. Thx in advance,

msma===
prg2_ll_sum3:count    4064
-------------------   -------
prg2_ll_sum3:min      -840.91
prg2_ll_sum3:median   -836.44
prg2_ll_sum3:max      -830.70
file                                                                          num_cols   num_rows
c5f047165d4b778ddd16807a6840a52a.msma.prg2.scan.csv          5      4,064
prg2_ll_sum3,log_likelihood,0.0185
log_likelihood,prg2_ll_sum1,0.0270
prg2_ll_sum3,llh_delta,0.0048
prg2_ll_sum1,llh_delta,-0.0034
mmed===
prg2_ll_sum3:count    4064
-------------------   --------
prg2_ll_sum3:min      -1119.47
prg2_ll_sum3:median   -1111.60
prg2_ll_sum3:max      -1105.50
file                                                                          num_cols   num_rows
c5f047165d4b778ddd16807a6840a52a.mmed.prg2.scan.csv          5      4,064
prg2_ll_sum3,log_likelihood,-0.0787
prg2_ll_sum1,log_likelihood,-0.0682
prg2_ll_sum3,llh_delta,0.0685
llh_delta,prg2_ll_sum1,0.0616
moas===
prg2_ll_sum3:count    4064
-------------------   --------
prg2_ll_sum3:min      -1541.08
prg2_ll_sum3:median   -1530.23
prg2_ll_sum3:max      -1515.66
file                                                                          num_cols   num_rows
c5f047165d4b778ddd16807a6840a52a.moas.prg2.scan.csv          5      4,064
prg2_ll_sum3,log_likelihood,0.0116
prg2_ll_sum1,log_likelihood,-0.0096
prg2_ll_sum3,llh_delta,0.0134
prg2_ll_sum1,llh_delta,0.0272
commented

@avilella , it's not concerning that progen2 likelihoods are not correlated with ESM-1v, especially for data that is not well-represented in Uniref (like antibodies)

FYI, i would simply use the average likelihood (across left-to-right and right-to-left) for your zero-shot variant prediction task

#6 should avoid some of the confusion. @enijkamp

@avilella I was wondering if you have any further updates on this as I've been trying to do similar stuff (see here). In my personal experiments, using the LM to pick the "most likely mutations" leads to sequences with highly skewed aa_type distributions (loads of A,N,Q residues). To be fair, I'm running this on totally de-novo seqs, so it could be that the LM just falls of the cliff in those regions of seq space...

The weirdest result for me so far is that, if you keep sampling the "most likely" mutations at randomly sampled indices, you eventually end up with sequences that look very non-natural (many repetitions and only a small fraction of aa_types used). So this whole idea of sequence enrichment using Language Models (video) (code) is not really working with this model it seems...

commented

there seems to be two separate concerns mixed in here. would be better to have separate issue threads

for @avilella : yes there are many occurrences where different models may not have correlated likelihoods. from a practical perspective, that's not necessarily bad. this may actually result in larger performance gains when ensembling. if you want the very best results, try ensembling ProGen2 with any of the following (EVE, ESM-1v, Transception)

commented

for @aiXander , are you using the latest codebase? there was a critical bug in the codebase for the first couple of days after release. it caused a lot of issues

what sampling technique are you using? i'd use a simple MCMC based approach as a strong baseline where the acceptance/rejection is determined by ProGen2. You can find a simple algorithm in our Genhance paper or the Biswas et al Low-N protein engineering paper

if you still are getting odd behavior, create a new issue with the WT and your process and we can try to debug what's going on...

This example script shows the explained behavior (just put in base progen dir and run from terminal):

start from a random init seq and then:

  • sample a random residue index
  • evaluate the likelihood of all 20 point mutations at that location
  • sample a single mutation using the normalized likelihoods as probabilities
  • repeat

For random init seqs this leads to quickly degrading sequence patterns, but for natural sequences it seems to be more stable. So I'm wodering if this is maybe showing that Progen cannot generalize it's likelihood scores to truly de-novo sequences...

Thanks @a-mad , can you point me to the repos/colabs for EVE and Transception? I'll add them to the mix to see how much they correlate to Progen2 and ESM-1v.

I was naively expecting that the definition of ESM-1v's llh "we find that [...] sequence log-likelihoods from GVP-GNN are effective zero-shot predictors of stability changes of protein complexes even without predicted structures as training data" was going to be similar in Progen2's case, i.e. that it would be a "predictors of stability" for the protein.