Converting checkpoints

Question

Converting checkpoints

peregilk opened this issue 3 months ago · comments

pere commented 3 months ago

Are there any scripts available for converting trained Gemma/Llama/Mistral MaxText checkpoints to HuggingFace?

Rafi Witten commented 3 months ago

@A9isha

A9isha · Answer 1 · Fri Mar 29 2024 04:09:25 GMT+0800 (China Standard Time)

Hi @peregilk , there isn't one yet but we will add one very soon! thanks for your patience.

pere · Answer 2 · Sat Mar 30 2024 20:29:28 GMT+0800 (China Standard Time)

@A9isha Thanks a lot for the answer. Really looking forward to this.

pere · Answer 3 · Wed Apr 10 2024 22:18:51 GMT+0800 (China Standard Time)

Sorry for bothering you again with this @A9isha. Do you have a rough estimate of when the HF conversion will be ready?

A9isha · Answer 4 · Thu Apr 11 2024 01:47:12 GMT+0800 (China Standard Time)

Hi @peregilk ,

Sorry for the delay. We have a PR in the works: #581

If you are up for a bit of an experiment, do you want to give it a shot and let us know if you hit any issue with this script.

pere · Answer 5 · Thu Apr 11 2024 02:18:35 GMT+0800 (China Standard Time)

Awesome. Ill give it a shot tomorrow, and report back.

pere · Answer 6 · Thu Apr 11 2024 18:33:24 GMT+0800 (China Standard Time)

Hi @A9isha,
I have now given it a try. Had some minor issues before I ran into a major issue. Ill report the small ones as well because they are mainly related to documentation. Maybe updating the documentation will save others from problems.

I have a checkpoint saved in:
gs://mybucket/north_mistral_warm_norwegian/checkpoints/150000

This is a continual training of a Mistral-7b-model on a Norwegian dataset. It has by default saved checkpoints every 10k steps, I am targeting the last checkpoint.

Your comments refer to running MaxText/llama_or_mistral_ckpt.py first. I assume this is just when converting from the meta-checkpoints, and not needed in my case.

I am starting by creating and cloning a HF-repo (where I plan to place to finished files) and a tmp-directory called /home/user/checkpoint/test-mistral-warm-nortoken.

I made two minor changes from the documentation here:

mistral -> mistral-7b (doc says to choose between mistral and llama)
Added "/items" to the end of the checkpoint-path

I am not really sure what the purpose of run_name is, but set it to "test".

My final command looks like this:
python MaxText/llama_or_mistral_orbax_to_huggingface.py MaxText/configs/base.yml base_output_directory=/home/user/checkpoint/test-mistral-warm-nortoken load_parameters_path=gs://mybucket/north_mistral_warm_norwegian/checkpoints/150000/items run_name=test model_name=mistral-7b hf_model_path=/home/user/test-mistral-warm-nortoken

This now runs for a couple of minutes. I see a couple of warnings that might indicate errors:

Found 0 checkpoint steps in /home/user/checkpoint/test-mistral-warm-nortoken/test/checkpoints

and:

I0411 10:23:46.518555 140658746231872 checkpointer.py:168] Restoring item from gs://maxlog-eu/north_mistral_warm_norwegian/checkpoints/150000/items.
W0411 10:23:51.243256 140658746231872 transform_utils.py:229] The transformations API will eventually be replaced by an upgraded design. The current API will not be removed until this point, but it will no longer be actively worked on.
I0411 10:23:51.243912 140658746231872 transform_utils.py:286] The following keys are not loaded from the original tree after applying specified transforms: params/params/decoder/decoder_norm/scale, params/params/decoder/layers/mlp/wi_0/kernel, params/params/decoder/layers/mlp/wi_1/kernel, params/params/decoder/layers/mlp/wo/kernel, params/params/decoder/layers/post_self_attention_layer_norm/scale, params/params/decoder/layers/pre_self_attention_layer_norm/scale, params/params/decoder/layers/self_attention/key/kernel, params/params/decoder/layers/self_attention/out/kernel, params/params/decoder/layers/self_attention/query/kernel, params/params/decoder/layers/self_attention/value/kernel, params/params/decoder/logits_dense/kernel, params/params/token_embedder/embedding
I0411 10:23:51.244194 140658746231872 checkpointer.py:171] Finished restoring checkpoint from gs://maxlog-eu/north_mistral_warm_norwegian/checkpoints/150000/items.

a while after that the conversion however crashes with this message:

In input checkpoint Number of model params=7.242 billion
Traceback (most recent call last):
  File "/home/user/maxtext/MaxText/llama_or_mistral_orbax_to_huggingface.py", line 215, in <module>
    app.run(main)
  File "/home/user/.t5x/lib/python3.10/site-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/home/user/.t5x/lib/python3.10/site-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "/home/user/maxtext/MaxText/llama_or_mistral_orbax_to_huggingface.py", line 211, in main
    convert_orbax_hf(hf_model_path, pyconfig.config)
  File "/home/user/maxtext/MaxText/llama_or_mistral_orbax_to_huggingface.py", line 198, in convert_orbax_hf
    new_hf_model_params = convert_state_to_hf(training_state, config.model_name)
  File "/home/user/maxtext/MaxText/llama_or_mistral_orbax_to_huggingface.py", line 119, in convert_state_to_hf
    hf_model_params["model.embed_tokens.weight"] = torch.tensor(
TypeError: can't convert np.ndarray of type numpy.object_. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.

A9isha · Answer 7 · Fri Apr 12 2024 02:57:01 GMT+0800 (China Standard Time)

@peregilk

Right I think this is caused by a recent breaking change in the way we are generating MaxText's Orbax checkpoints.

#568

Could you please regenerate your MaxText checkpoint with the latest code (i.e., after including the PR#568), and try out the script llama_or_mistral_orbax_to_huggingface.py?

pere · Answer 8 · Fri Apr 12 2024 05:12:45 GMT+0800 (China Standard Time)

@A9isha I did as you said. Deleted MaxText, recloned and resinstalled requirements. Then I tried training with exactly the same commands, with a new run name.

I keep getting this. Both when initialising Gemma and Mistral:

Traceback (most recent call last):
  File "/home/perk/maxtext/MaxText/train.py", line 524, in <module>
    app.run(main)
  File "/home/perk/.local/lib/python3.10/site-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/home/perk/.local/lib/python3.10/site-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "/home/perk/maxtext/MaxText/train.py", line 506, in main
    pyconfig.initialize(argv)
  File "/home/perk/maxtext/MaxText/pyconfig.py", line 391, in initialize
    _config = _HyperParameters(argv, **kwargs)
  File "/home/perk/maxtext/MaxText/pyconfig.py", line 205, in __init__
    _HyperParameters.user_init(raw_keys)
  File "/home/perk/maxtext/MaxText/pyconfig.py", line 238, in user_init
    calculate_global_batch_sizes(raw_keys)
  File "/home/perk/maxtext/MaxText/pyconfig.py", line 333, in calculate_global_batch_sizes
    expansion_factor_real_data = raw_keys['expansion_factor_real_data']
KeyError: 'expansion_factor_real_data'

Is this related? Or should I report as separate issue?

A9isha · Answer 9 · Fri Apr 12 2024 06:57:23 GMT+0800 (China Standard Time)

expansion_factor_real_data was added in this PR #187

But it has a default value in base.yml https://github.com/google/maxtext/blob/main/MaxText/configs/base.yml#L167
Could you check if your current (re)cloned repo has this PR's updates for e.g., in base.yml?

pere · Answer 10 · Fri Apr 12 2024 14:59:13 GMT+0800 (China Standard Time)

My bad. I tried replicating the experiment with my custom .yml. I did not realise that there were updates in base.yml. The model is now training, and at the first checkpoint I'll be able to test the export again. I will report my results here. Thanks @A9isha

pere · Answer 11 · Fri Apr 12 2024 17:01:09 GMT+0800 (China Standard Time)

A quick update, @A9isha. I tried converting the 0-checkpoints that were generated at the start of training. That ran without any warnings/issues, and seems to have produced pytorch model files! Thanks! I will push to HF and test.

pere · Answer 12 · Fri Apr 12 2024 23:28:02 GMT+0800 (China Standard Time)

@A9isha What about the tokenizers here. Any path to convert any SentencePiece .model files to Hugging Face?

**** update
I think this solves that issue but I have not had time to test it thoroughly yet: https://github.com/NbAiLab/tokenizer-benchmark/blob/main/sentencepiececonverter/convert.sh

pere · Answer 13 · Mon Apr 15 2024 15:47:09 GMT+0800 (China Standard Time)

Hi @A9isha. I am trying to recreate my experiments here so that I am able to convert my models to HF. My first models were trained 3 weeks ago. If I understand correctly, there are also some updates to the conversion script here, so to restart the models I also need to run MaxText/llama_or_mistral_ckpt.py again.

My main sanity check here is if I am able to do a warm restart of the Mistral-7b model using the same tokenizer and a Norwegian c4-corpus from tfds. I am trying to use the exact same settings as eariler, though I see there are some changes to base.yml.

However, the result really puzzles me:

The graphs should be self explanatory.

I am training on v5e-128 with these parametersl:
per_device_batch_size=4
ici_fsdp_transpose_parallelism=16
remat_policy=minimal

This might not be related to the checkpointing at all. Tell me if you want to open a separate issue on it.

Haoqin Tu · Answer 14 · Sun May 05 2024 18:09:15 GMT+0800 (China Standard Time)

Hi @A9isha, does Maxtext support the other way round now? That's converting HF's Llama or Mistral weights to MaxText checkpoints. Thanks

A9isha · Answer 15 · Tue May 07 2024 02:35:36 GMT+0800 (China Standard Time)

@peregilk Apologies for the delayed response, I was OOO for sometime.

My main sanity check here is if I am able to do a warm restart of the Mistral-7b model using the same tokenizer and a Norwegian c4-corpus from tfds
Yes, the loss curve seems terrible, but yet your idea is correct that you should be definitely able to use your tokenizer and dataset for finetuning the converted checkpoint.

If I understand correctly, there are also some updates to the conversion script here
The changes were not made to checkpoint conversion script but actually to the way the checkpoints are written out and those changes would not have any effect of loss, it's just cosmetic changes.

Let me know if you were able to investigate more on this

A9isha · Answer 16 · Tue May 07 2024 02:38:02 GMT+0800 (China Standard Time)

Hi @A9isha, does Maxtext support the other way round now? That's converting HF's Llama or Mistral weights to MaxText checkpoints. Thanks

We have the script llama_or_mistral_ckpt.py to convert the original PyTorch Llama2 checkpoint that Meta provides into MaxText checkpoint.

You can see the usage here for Llama2-7b for e.g.

Haoqin Tu · Answer 17 · Tue May 07 2024 10:07:21 GMT+0800 (China Standard Time)

We have the script llama_or_mistral_ckpt.py to convert the original PyTorch Llama2 checkpoint that Meta provides into MaxText checkpoint.

You can see the usage here for Llama2-7b for e.g.

Thanks for the pointer @A9isha ! I'm still wondering if there's a direct script for converting HF's LLaMA2-like weight to MaxText weight. Since I might want to use another version of LLaMA2 trained by others hosted on HuggingFace. Thanks!

A9isha · Answer 18 · Wed May 08 2024 02:15:11 GMT+0800 (China Standard Time)

I see, unfortunately no there isn't the conversion script at the moment. It should be a modification of llama_or_mistral_ckpt. If you are interested, please feel free to send across a PR.

Haoqin Tu · Answer 19 · Wed May 08 2024 12:55:50 GMT+0800 (China Standard Time)

I see, unfortunately no there isn't the conversion script at the moment. It should be a modification of llama_or_mistral_ckpt. If you are interested, please feel free to send across a PR.

Thanks @A9isha, I'm working on it and will try to open a PR for it soon :)

pere · Answer 20 · Tue May 21 2024 20:21:44 GMT+0800 (China Standard Time)

@peregilk Apologies for the delayed response, I was OOO for sometime.

My main sanity check here is if I am able to do a warm restart of the Mistral-7b model using the same tokenizer and a Norwegian c4-corpus from tfds
Yes, the loss curve seems terrible, but yet your idea is correct that you should be definitely able to use your tokenizer and dataset for finetuning the converted checkpoint.

If I understand correctly, there are also some updates to the conversion script here
The changes were not made to checkpoint conversion script but actually to the way the checkpoints are written out and those changes would not have any effect of loss, it's just cosmetic changes.

Let me know if you were able to investigate more on this

@A9isha. Sorry for taking some time to reply to this. I had to give up on this. I am mainly doing experiments on dataset composition, and needed something that worked and could be converted to HuggingFace. I was unable to accomplish that with MaxText.

A9isha · Answer 21 · Wed May 22 2024 06:30:21 GMT+0800 (China Standard Time)

@peregilk I am very sorry to hear that. Does the new feature of supporting HuggingFace dataset on MaxText help you with the data composition effort?
Please feel free to reopen/create a new issue with more findings.
Good luck! :)

pere · Answer 22 · Wed May 22 2024 14:28:44 GMT+0800 (China Standard Time)

Yes. The added HuggingFace dataset support actually makes it super easy to try to replicate some of the experiments I am currently running on Levanter. It would be great to check this out. Maybe some of the issues I was struggling with is fixed now. MaxText definitively seemed to be faster.

pere · Answer 23 · Fri Jun 07 2024 17:24:58 GMT+0800 (China Standard Time)

@A9isha I have now converted my experiments from Levanter to MaxText, and will continue using it. I am super impressed by the speed and stability of MaxText. Though it takes some steps, I am able to convert the results back into HuggingFace, and have verified that the results looks good. Thanks a lot for implementing this!!

I am also using HF Datasets for training here, and it works super smooth (Thanks @aireenmei!).

With fixing checkpoint-conversions and dataset loading, I have solved my main issues with using MaxText for my research (that is more related to effects of dataset composition). It would however also be great to do experiments on the newest versions of the current models. I have opened issue #683 related to that. Do you know the status here, @A9isha?

aireenmei · Answer 24 · Sat Jun 08 2024 01:59:21 GMT+0800 (China Standard Time)

@A9isha is out of office. I believe Llama3 is being worked on cc @khatwanimohit