google / maxtext

A simple, performant and scalable Jax LLM!

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Question] convert HF tokenizer to maxtext tokenizer?

YannDubs opened this issue · comments

llama_or_mistral_ckpt.py provides the code to convert LLaMA/mistral weights to maxtext ones, is there a script to do the same for the tokenizer? and more generally from any HF tokenizer?

thanks!

for mistral, download their tokenizer from https://github.com/mistralai/mistral-src and there is no conversion needed.

For Mistral tokenizer, I downloaded their model using wget https://models.mistralcdn.com/mistral-7b-v0-1/mistral-7B-v0.1.tar. After that, should I directly put the extracted mistral-7B-v0.1/tokenizer.model under maxtext/assets and everything is all set?

Thank you very much for your time and help!

@LeoXinhaoLee using tokenizer_path="mistral-7B-v0.1/tokenizer.model" worked for me. Closing as a result!