ludwig-ai / ludwig

Low-code framework for building custom LLMs, neural networks, and other AI models

Home Page:http://ludwig.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Impossibility to use a tokenizer with auto_transformer

sergsb opened this issue · comments

I want to use this model as an encoder. As you can see from the description, the model can be uploaded like:

model = AutoModel.from_pretrained("ibm/MoLFormer-XL-both-10pct", deterministic_eval=True, trust_remote_code=True) 
tokenizer = AutoTokenizer.from_pretrained("ibm/MoLFormer-XL-both-10pct", trust_remote_code=True)

I try to load it using

encoder: auto_transformer
   pretrained_model_name_or_path: ibm/MoLFormer-XL-both-10pct

It results in RuntimeError: Caught exception during model preprocessing: Tokenizer class MolformerTokenizer does not exist or is not currently imported. This is not surprising, because this model does not use the specific MolformerTokenizer but AutoTokenizer instead.

However, the documentation says that "If a text feature's encoder specifies a huggingface model, then the tokenizer for that model will be used automatically.".

How can I load the tokenizer for this model?

I found out that the problem is with trust_remote_code, which is also mandatory for loading tokenizers.
see also #3632

Hi @sergsb,

Thanks for sharing your experience.

The Ludwig team is focused on building first class support for natively supported models on HF. As I understand, supporting models that require trust_remote_code=True is tenable, but carries other risks that need to be thought through.

CC: @arnavgarg1

Hi @justinxzhao,

Thanks for the answer. Maybe an option would be introducing a global config parameter, trust_remote_code, and set it to HF models and tokenizers?

@sergsb that seems reasonable to me. I think that's what @arnavgarg1 was going for in #3632, specifically here.