Impossibility to use a tokenizer with auto_transformer
sergsb opened this issue · comments
I want to use this model as an encoder. As you can see from the description, the model can be uploaded like:
model = AutoModel.from_pretrained("ibm/MoLFormer-XL-both-10pct", deterministic_eval=True, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("ibm/MoLFormer-XL-both-10pct", trust_remote_code=True)
I try to load it using
encoder: auto_transformer
pretrained_model_name_or_path: ibm/MoLFormer-XL-both-10pct
It results in RuntimeError: Caught exception during model preprocessing: Tokenizer class MolformerTokenizer does not exist or is not currently imported.
This is not surprising, because this model does not use the specific MolformerTokenizer
but AutoTokenizer
instead.
However, the documentation says that "If a text feature's encoder specifies a huggingface model, then the tokenizer for that model will be used automatically."
.
How can I load the tokenizer for this model?
I found out that the problem is with trust_remote_code
, which is also mandatory for loading tokenizers.
see also #3632
Hi @sergsb,
Thanks for sharing your experience.
The Ludwig team is focused on building first class support for natively supported models on HF. As I understand, supporting models that require trust_remote_code=True
is tenable, but carries other risks that need to be thought through.
CC: @arnavgarg1
Hi @justinxzhao,
Thanks for the answer. Maybe an option would be introducing a global config parameter, trust_remote_code
, and set it to HF models and tokenizers?
@sergsb that seems reasonable to me. I think that's what @arnavgarg1 was going for in #3632, specifically here.