datamol-io / molfeat

molfeat - the hub for all your molecular featurizers

Home Page:https://molfeat.datamol.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

New featurizer

ethancohen123 opened this issue · comments

Description & Motivation

It seems to be trained on more than any of the featurizer actually present in molfeat

Is your featurizer open-source ?

  • Yes it's open source

Are you willing to contribute to the PR for this featurizer ?

  • Yes I'm willing to contribute

Pitch

Hi,
I just saw this featurizer out there and seems good : https://huggingface.co/entropy/gpt2_zinc_87m
Any chance it might be added to molfeat ? @cwognum @hadim
Thanks

Featurizer description

Featurizer card
# list of authors
authors: 
  - author 1
# describe the featurizer 
description: ~ 
# which type of input does the featurizer expect ? 
inputs: ~ 
# name of the featurizer
name: ~ 
# reference of the featurizer (a paper or a link)
reference: ~ 
# what does the featurizer return as output for molecular representation ?
representation: ~ 
# does the featurizer require 3D information ?
require_3D:  ~ 

Hi @ethancohen123 ,

Thank you for the proposal! Since molfeat already supports loading 🤗 Transformers through our PretrainedHFTransformer class, this should be a relatively easy addition! I went ahead and tried filling all the details for the Model Card.

{
"name": "GPT2 Zinc 87m", 
"inputs": "smiles", 
"type": "pretrained", 
"version": 0, 
"group": "huggingface", 
"submitter": "Ethan Cohen", 
"description": "This is a GPT2 style autoregressive language model with ~87m parameters, trained on ~480m SMILES strings from the ZINC database. This model is useful for generating drug-like molecules or generating embeddings from SMILES strings", 
"representation": "line-notation", 
"require_3D": false, 
"tags": ["smiles", "huggingface", "transformers", "GPT2", "Zinc"], 
"authors": ..., 
"reference": ..., 
}

What I couldn't figure out, is whether there's any scientific publication associated with this work. Are you aware of any such publication? If not, maybe we can open a discussion on 🤗 to clarify?

Both models in the huggingface repo have been added.

See: https://github.com/datamol-io/molfeat/blob/5f314ee6b3a90bb79834df80eb12755e9f8ce2f1/nb/etl/entropy-transforner-zinc-etl.ipynb and #54

@cwognum I have already filled the information and uploaded the models. There is no paper, but the model are MIT licensed and the github repos are indicated.

I will fix cuda-based featurization for huggingface before the next release, but in the meantime, you can already do:

import torch
from molfeat.trans.pretrained import PretrainedHFTransformer
smiles = [
  'Brc1cc2c(NCc3ccccc3)ncnc2s1', 
  'Brc1cc2c(NCc3ccccn3)ncnc2s1',
  'Brc1cc2c(NCc3cccs3)ncnc2s1',
  'Brc1cc2c(NCc3ccncc3)ncnc2s1',
  'Brc1cc2c(Nc3ccccc3)ncnc2s1'
]
gpt_transformer = PretrainedHFTransformer("GPT2-Zinc480M-87M", max_len=256, pooling="mean", layer=-1, dtype=torch.float)
gpt_transformer(smiles)
commented

Hi All,
I'm the author of the models here. Really cool to see molfeat incorporating them!

Couple points:
There are no associated publications, you can just cite to the associated github repos gpt model, roberta model.

For pooling, I would use mean pooling for both models. I saw this notebook looked at GPT pooling, which (if I'm reading the code correctly) is not correct for these models.

From the GPTPooler code:

"""
Default GPT pooler as implemented in huggingface transformers
The Bart pooling function focusing on the first token ([CLS]) to get a sentence representation.
"""
...
if self.pad_token_id is None:
    sequence_lengths = -1
else:
    sequence_lengths = torch.ne(inputs, self.pad_token_id).sum(-1) - 1
pooled_output = h[torch.arange(batch_size), sequence_lengths]

If I'm reading this correctly, this code is grabbing the final non-padding token in each sentence expecting it to be a CLS token (even though the docstring says CLS should be the first token?). The Zinc GPT model doesn't use a CLS token, and no additional classification/prediction task was used during training, so I would stick with mean pooling over the representations.

Hey @kheyer, thanks for getting back to us.

The default pooling is mean pooling everywhere. In the notebook, you can see, that this is what is evaluated directly against the snippet in your huggingface repo and the exact same embeddings are returned.

There was some typos in the documentation about the GPT pooling (and they have been fixed) which is not looking for the cls token but the last non padding token in the input (meaning likely the eos token). There is a strong rational for eos token being the best representation of a sequence for decoder-only models as you intuitively model P(sequence) and this is the approach in most GPT based sequence level tasks. Here we are not explicitly taking the eos token and instead going through checking the sequence length, exactly to deal with cases where eos or clf tokens are not used.

If you are finetuning the GPT model and not using linear probing, then you likely want to get the representation of the last non-padding token instead of the mean.

@mercuryseries can you update the website ?