Impossibility to use a tokenizer with auto_transformer

Question

Impossibility to use a tokenizer with auto_transformer

sergsb opened this issue 5 months ago · comments

I want to use this model as an encoder. As you can see from the description, the model can be uploaded like:

model = AutoModel.from_pretrained("ibm/MoLFormer-XL-both-10pct", deterministic_eval=True, trust_remote_code=True) 
tokenizer = AutoTokenizer.from_pretrained("ibm/MoLFormer-XL-both-10pct", trust_remote_code=True)

I try to load it using

encoder: auto_transformer
   pretrained_model_name_or_path: ibm/MoLFormer-XL-both-10pct

It results in RuntimeError: Caught exception during model preprocessing: Tokenizer class MolformerTokenizer does not exist or is not currently imported. This is not surprising, because this model does not use the specific MolformerTokenizer but AutoTokenizer instead.

However, the documentation says that "If a text feature's encoder specifies a huggingface model, then the tokenizer for that model will be used automatically.".

How can I load the tokenizer for this model?

Sergey Sosnin · Answer 1 · Thu Jan 18 2024 22:14:49 GMT+0800 (China Standard Time)

I found out that the problem is with trust_remote_code, which is also mandatory for loading tokenizers.
see also #3632

Justin · Answer 2 · Thu Jan 18 2024 23:23:53 GMT+0800 (China Standard Time)

Hi @sergsb,

Thanks for sharing your experience.

The Ludwig team is focused on building first class support for natively supported models on HF. As I understand, supporting models that require trust_remote_code=True is tenable, but carries other risks that need to be thought through.

CC: @arnavgarg1

Sergey Sosnin · Answer 3 · Thu Jan 18 2024 23:30:32 GMT+0800 (China Standard Time)

Hi @justinxzhao,

Thanks for the answer. Maybe an option would be introducing a global config parameter, trust_remote_code, and set it to HF models and tokenizers?

Justin · Answer 4 · Thu Jan 18 2024 23:36:26 GMT+0800 (China Standard Time)

@sergsb that seems reasonable to me. I think that's what @arnavgarg1 was going for in #3632, specifically here.