datamol-io / molfeat

Is there an existing issue for this?

I have searched the existing issues and found nothing

Bug description

I try to search model with:
model_card = store.search(name="ChemBERTa-77M-MLM")[0]

but got:
AttributeError: 'ModelInfo' object has no attribute 'model_dump'

How to reproduce the bug

No response

Error messages and logs

# Error messages and logs here please

Environment

Current environment

#- Molfeat version (e.g., 0.1.0):
#- PyTorch Version (e.g., 1.10.0):
#- RDKit version (e.g., 2022.09.5): 
#- scikit-learn version (e.g.,  1.2.1): 
#- OS (e.g., Linux):
#- How you installed Molfeat (`conda`, `pip`, source):

Additional context

No response

Hey @peiyaoli ! Thank you for reporting this bug! Before we investigate, it would be useful to have some additional information on your environment, specifically your Molfeat version, Python version, Pydantic version and how you installed your environment (i.e. pip, conda, source).

Something is wrong with downloading at the stage of hash sum comparison
The error is raised here: https://github.com/datamol-io/molfeat/blob/97855c6c7df2c46acb698d64eab60b08006c8936/molfeat/store/modelstore.py#L235C2-L235C2

@cwognum do you want to have a look ?

I am getting a similar hash sum error when trying to download the chemGPT model.

I installed molfeat into its own conda environment and used the 'conda install -c conda-forge molfeat' for the install.

I'm not sure what's happening here.

I could reproduce the bug locally. As the error suggests, it appears the checksum computed when we created the artifact no longer matches the checksum we compute when downloading the artifacts locally. I'm not sure what causes this. Recreating the artifact using the ETL notebooks leads to an entirely new hashsum that doesn't match any of the artifacts in the thrown exception. I will investigate further!

@cwognum, did you manage to find the error ? Looking at the code, my first take would be that the order of the file in the shasum changed for some reasons ...

molfeat/molfeat/utils/commons.py

Lines 45 to 55 in 97855c6

    
           if dm.fs.is_dir(filepath): 
        
               files = list(dm.fs.glob(os.path.join(filepath, "**", "*"))) 
        
           else: 
        
               files = [filepath] 
        
           file_hash = hashlib.sha256() 
        
           for filepath in files: 
        
               with fsspec.open(filepath) as f: 
        
                   file_hash.update(f.read())  # type: ignore 
        
           file_hash = file_hash.hexdigest() 
        
           return file_hash

Can you generate all randomization of the filelist order and check if you can recover the original shasum ?

That's indeed one of the things I considered. I haven't tried all combinations though, that's a good idea!

@maclandrol I tried the different permutations, but it does not include the actual hash.

Ok, I think for now, maybe just apply sort on the file names, to ensure the same consistent order, then recompute the hash and update the hash in the metadata.

I looked into a couple of things, but I'm not sure what started causing the issue, which is a very unsatisfying conclusion.

Some further notes for future reference:

The datamol fs module was updated: datamol-io/datamol#210
Things seemed to start breaking when fsspec released 2023.9.2, but nothing suspicious in the changelogs. Maaaaybe the changes related to caching?

I can imagine that the above changes affect the order of the files, but not actually the number of files or the content of the files. Since trying all permutation of file-paths to recover the expected hash doesn't work, I'm not sure what's happening here.

Anyways, the fix is luckily simple! I recreated the featurizer artifacts by simple rerunning the ETL notebooks and all seems to work again. I made a small PR to sort the to-be-hashed files: #86

Let me know if the issue persists or pops up again! If so, we will have to investigate further.

One final note: ChemGPT-1.2B ~~and ChemGPT-19M~~ is still running

Thank you @cwognum

The recent changes have fixed the issue on my side.

Thanks @cwognum I'll check and let you know asap if it also fixes the issue on my side!

Bug description

Hello! @cwognum

I'm currently using molfeat in a conda env by running pip install molfeat=0.8.8. When I try to fetch a Pretrained HF transformers like GPT2-Zinc480, Roberta-Zinc480M-102M and MolT-5, I would get the following error message:

ModelStoreError: Can't retrieve model MolT5 from the store !

It seems related to caches and where the models are stored locally, what should I do?

How to reproduce the bug

from molfeat.trans.pretrained.hf_transformers import PretrainedHFTransformer
transformer = PretrainedHFTransformer(kind='MolT5', notation='selfies', dtype=float)
features = transformer(my_smiles_list)

Error messages and logs

ModelStoreError                           Traceback (most recent call last)
File ~\anaconda3\lib\site-packages\molfeat\store\loader.py:100, in PretrainedStoreModel._load_or_raise(cls, name, download_path, store, **kwargs)
     99     modelcard = store.search(name=name)[0]
--> 100     artifact_dir = store.download(modelcard, download_path, **kwargs)
    101 except Exception as e:

File ~\anaconda3\lib\site-packages\molfeat\store\modelstore.py:216, in ModelStore.download(self, modelcard, output_dir, chunk_size, force)
    215     mapper.fs.delete(output_dir, recursive=True)
--> 216     raise ModelStoreError(
    217         f"""The destination artifact at {model_dest_path} has a different sha256sum ({cache_sha256sum}) """
    218         f"""than the Remote artifact sha256sum ({modelcard.sha256sum}). The destination artifact has been removed !"""
    219     )
    221 return output_dir
ModelStoreError: The destination artifact at C:\Users\dd\AppData\Local\molfeat\molfeat\Cache/MolT5/model.save has a different sha256sum (e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855) than the Remote artifact sha256sum (e0537549289bfffc9ba6a5fb17c5b8d031e1b04a17555fd8f6494ebe3ce79395). The destination artifact has been removed !

Environment

#- Molfeat version (e.g., 0.1.0): 0.8.8
#- Python version (e.g., 1.10.0): 3.10.9
#- RDKit version (e.g., 2022.09.5):  2023.03.1
#- scikit-learn version (e.g.,  1.2.1): 1.2.1
#- OS (e.g., Linux): Windos 10 22H2
#- How you installed Molfeat (`conda`, `pip`, source): pip

Hi @dawndarkmusic, thanks for reporting! Could you try upgrading to the latest molfeat version? We might have broken backwards compatibility with #86 by ordering the to-be-hashed files.

Ping @maclandrol - If we end up needing some more robust model / data versioning, I recently came across https://github.com/iterative/dvc which seems pretty powerful!

Hello @cwognum
I've upgraded the molfeat to version 0.9.5 but still got the same issue with using the same code.
Here are some screenshots while running the code, it generated the loading task bar every time and then popped out the error. When I used the PretrainedHFTransformer before it wouldn't show the task bar in my memory, should I restart the computer to test it again? Any help will be greatly appreciated! Thank you

Ping @maclandrol - If we end up needing some more robust model / data versioning, I recently came across https://github.com/iterative/dvc which seems pretty powerful!

Yes, moving away from the custom model store would be a good idea. We are getting too many issues related to GCS.

@dawndarkmusic can you follow the instruction here to delete your cache and try again ?

#29 (comment)

In your case, it's better to clear the whole cache directory, and then restart your python runtime.

import datamol as dm
import platformdirs

# delete the cache dir
path_dir = platformdirs.user_cache_dir("molfeat")
mapper = dm.fs.get_mapper(path_dir)
mapper.fs.delete(path_dir, recursive=True)

Hello @maclandrol

I've tried to clear the whole cache directory and restart it again but still got the same issue
Here are the recorded shorts of the situation, the pretrained transformer disappeared from the cache folder once the error popped out, sorry for bothering you and @cwognum .

Untitled.video.-.Made.with.Clipchamp.1.mp4

Thanks @dawndarkmusic, this is very strange. Even when deleting and restarting the interpreter, you are having the same issue ?

Some of the function are functools cached, so data is saved in memory. Normally by restarting the interpreter, it should purge that data too. @cwognum can you handle this ?