ludwig-ai / ludwig

Describe the bug

There seems to be a mismatch between how model.save() and LudwigModel.upload_to_hf_hub() handle model paths. A model saved into a directory with a custom name using model.save cannot be uploaded to HF Hub using the same directory name.

To Reproduce
Steps to reproduce the behavior:

Fine-tune a LLM (I used the Zephyr fine-tuning example from Predibase)
Save the fine-tuned model into a directory using model.save("finetuned-model")
Try to upload the model to HF Hub using LudwigModel.upload_to_hf_hub(MY_HF_MODEL_NAME, "finetuned-model")
See error:

File .../ludwig-finetune-llm/venv/lib/python3.11/site-packages/ludwig/utils/upload_utils.py:101, in BaseModelUpload._validate_upload_parameters(repo_id, model_path, repo_type, private, commit_message, commit_description)
     99 trained_model_artifacts_path = os.path.join(model_path, "model", "model_weights")
    100 if not os.path.exists(trained_model_artifacts_path):
--> 101     raise Exception(
    102         f"Model artifacts not found at {trained_model_artifacts_path}. "
    103         f"It is possible that model at '{model_path}' hasn't been trained yet, or something went"
    104         "wrong during training where the model's weights were not saved."
    105     )

Exception: Model artifacts not found at finetuned-model/model/model_weights. It is possible that model at 'finetuned-model' hasn't been trained yet, or something wentwrong during training where the model's weights were not saved.

Expected behavior

Expected the upload to succeed, since I gave the same path name, "finetuned-model", as a parameter both to model.save and to LudwigModel.upload_to_hf_hub.

Screenshots

n/a

Environment (please complete the following information):

OS: Linux
Version: AlmaLinux release 8.7 (Stone Smilodon)
Python version 3.11.5
Ludwig version 0.9.2

Additional context

After model.save, this is the file and directory structure created under the finetuned-model directory:

model_hyperparameters.json
training_set_metadata.json
model_weights/adapter_config.json
model_weights/adapter_model.safetensors
model_weights/README.md

But according to the error message, upload_to_hf_hub is checking for the existence of finetuned-model/model/model_weights. It doesn't exist (there is no intermediate directory called model) so this fails with the above error. Below is the relevant code. Note that model is always added to the path on line 99.

ludwig/ludwig/utils/upload_utils.py

Lines 98 to 105 in 51f38c5

    
           # Make sure the model is actually trained 
        
           trained_model_artifacts_path = os.path.join(model_path, "model", "model_weights") 
        
           if not os.path.exists(trained_model_artifacts_path): 
        
               raise Exception( 
        
                   f"Model artifacts not found at {trained_model_artifacts_path}. " 
        
                   f"It is possible that model at '{model_path}' hasn't been trained yet, or something went" 
        
                   "wrong during training where the model's weights were not saved." 
        
               )

FWIF, I was able to work around the problem by creating the directory finetuned-model/model and moving all files below that directory.

Thanks for reporting the issue @osma and I'm glad you were able to find a workaround. @alexsherstinsky Will be taking a look at it over the next few days

@sanjaydasgupta Has graciously agreed to work on this issue.

#take

After poking around a bit in the code-base here is what I learned:

Changing the implementation of Ludwig.api.LudwigModel.save() so that it first suffixes a literal /model to the provided save_path (an intuitive but naive solution) breaks 7 or 8 tests. Inspection of those tests reveals that save() has a twin -- predictably named load(). Also, save() and load() are designed to play well together as is.

Therefore, changing save() so that it works more intuitively with upload_to_hf_hub() will entail changes to load() and possibly other things -- probably be a lot of work. One alternative to save that effort is to improve the documentation of save() to clearly state that the save_path provided must end in a model. That same save_path can then be passed to upload_to_hf_hub() successfully.

@alexsherstinsky please let me know what you think.

Thanks for your research and explanation @sanjaydasgupta !

From a naïve user perspective, what surprised me is that upload_to_hf_hub() requires a literal model directory. So instead of changing the implementation of save() and load(), which undoubtedly would cause many problems, I would suggest instead looking at whether this requirement in upload_to_hf_hub() could be lifted. The simplest possible fix would be to remove "model" from line 99 of upload_utils.py, which I quoted in my OP.

@sanjaydasgupta and @osma Thank you very much for reporting this issue and the detailed analysis.

@sanjaydasgupta Is it possible to investigate the implementation from the first principles? Let us figure out what is happening first, and then discuss potential fixes with backward-compatibility in mind to avoid breaking changes.

We know that the default results relative directory is /results/api_experiment_run/model/model_weights where the next model will be saved into /results/api_experiment_run_1/model/model_weights and then next into /results/api_experiment_run_2/model/model_weights, and so on. This is what we pass to upload_to_hf_hub(). The reason for that is that HuggingFace expects only such files as:

adapter_config.json
adapter_model.safetensors
README.md

because this is all that is needed for loading the adapter. The configuration file, which you added in your previous PR is needed for Ludwig fine-tuning reproducibility concerns, but it is not needed for loading adapters.

So it looks like the custom name finetuned-model replaced the generic api_experiment_run, api_experiment_run_1, api_experiment_run_2, etc. directory names, with the rest of the structure (/model and model_weights) preserved. Thus, loading to HuggingFace would be done using:

LudwigModel.upload_to_hf_hub(MY_HF_MODEL_ID, "/results/finetuned-model/model/model_weights")

(Hence, I would prefer using the name finetuned_model as traditionally with the underscore for clarity when saving and loading the model -- because this is the path on the local filesystem, not an ID in a cloud storage provider.)

If we try it and it works, then it means that we have to perhaps update the documentation in order to avoid confusion.

Please let me know what you think. Thank you!

Hi @alexsherstinsky and @osma, thank you very much for your inputs. Allow me to add some more information about these two functions:

Both, LudwigModel.save() and LudwigModel.upload_to_hf_hub() take a path parameter as input. But the meaning of that parameter is different. Compare the published documentation for the two functions:

save()'s parameter save_path is defined as: path to the directory where the model is going to be saved. Both a JSON file containing the model architecture hyperparameters and checkpoints files containing model weights will be saved.
upload_to_hf_hub()'s parameter model_path is defined as: The path of the saved model. This is the top level directory where the models weights as well as other associated training artifacts are saved.

Despite the somewhat similar beginning of those two descriptions, they are different. Referring to the screenshot (see below) of a typical Ludwig run, upload_to_hf_hub() needs its model_path parameter to point to the api_experiment_run directory. But save() creates the contents of the model directory (as seen in the screenshot) in the directory pointed to by save_path.

This difference has existed since before PR-3965.

We can change the implementation of upload_to_hf_hub() to examine the contents of model_path, determine what it has been passed, and then do the right thing. That will also protect any existing code that uses it. upload_to_hf_hub() does not have any in-project clients, so it is easier and safer to change it's implementation rather than save()'s.

@sanjaydasgupta So it looks like the implementation is correct, it is just that possibly the documentation may need to be improved. The purpose of these functions is indeed different -- they are not symmetric. The save() method is Ludwig focused -- it saves the various artifacts that Ludwig creates, including the hyperparameters (the full config) and the actual model weights. However, the HuggingFace upload only concerns itself with the latter (and we recently added the config). So if we add any protection/checks, we should do it in a way that constitutes a non-breaking change for all users thus far. Thank you very much!

Thanks for your research and explanation @sanjaydasgupta !

From a naïve user perspective, what surprised me is that upload_to_hf_hub() requires a literal model directory. So instead of changing the implementation of save() and load(), which undoubtedly would cause many problems, I would suggest instead looking at whether this requirement in upload_to_hf_hub() could be lifted. The simplest possible fix would be to remove "model" from line 99 of upload_utils.py, which I quoted in my OP.

LudwigModel.upload_to_hf_hub() is a thin wrapper that just invokes ludwig.utils.upload_utils.HuggingFaceHub.upload(). All implementation, including parameter checks, occurs after that point. Unfortunately there appear to be 10 or more other execution paths that run through HuggingFaceHub.upload(), so removing the requirement that the directory be named model will likely have a large impact.

Hi @osma, I am trying to get a better understanding of your use-case.

It seems you had already fine-tuned an LLM, and then wanted to upload the fine-tuned model to HF. Normally, in this situation, calling upload_to_hf_hub() with a model_path pointing to the api_experiment_run directory is sufficient (it is not necessary to explicitly call save() before calling upload_to_hf_hub()). Am I missing something about your use-case?

@sanjaydasgupta and @alexsherstinsky met to discuss this and agreed that there is a design issue that needs to be studied deeper, potentially affecting "ludwig/utils/upload_utils.py" and modules that use it.

Thanks for the detailed discussion @alexsherstinsky.

@sanjaydasgupta

It seems you had already fine-tuned an LLM, and then wanted to upload the fine-tuned model to HF. Normally, in this situation, calling upload_to_hf_hub() with a model_path pointing to the api_experiment_run directory is sufficient (it is not necessary to explicitly call save() before calling upload_to_hf_hub()). Am I missing something about your use-case?

Yes, you are right. I was just learning to use Ludwig, trying to replace a brittle set of Python fine-tuning code (based on transformers, trl, bitsandbytes, peft...) with a higher level approach. So my mindset was that I first want to fine-tune a model, then save it (so it won't get lost after spending a long time fine-tuning it), then run some inference tests to make sure that the model does what I want to. I was looking at tutorials like the Zephyr example from Predibase (which is half Ludwig tutorial, half marketing of the Predibase API service, which seems to be a common pattern nowadays).

After a few iterations of testing, I was happy with the fine-tuned model, so I wanted to publish it on HF Hub. I looked at Ludwig API specs and found the promising upload_to_hf_hub method, so I tried to use it, and ended up in the situation described in the OP.

Nowhere in my thought process did I really think about the api_experiment_run directory. To me, that was some internal housekeeping stuff that Ludwig handles, not interesting unless I want to dig deeper into the details of the fine-tuning process (and my goal was rather to get away from lower level stuff into a higher abstraction level). I'm used to other tools and APIs for machine learning such as scikit-learn, Keras, torch, trl etc. and maybe my expectations were influenced by how those work; for example model.train() and model.save() are pretty common operations in most if not all of these.

I hope that explains my thinking. I'm just a beginner, trying to get an understanding of tools like Ludwig. Maybe reading some other documentation would have helped me avoid the problem. But I think you may have some deeper conceptual issues that could be solved to make your tool easier to learn and apply.

@osma Your criticism is absolutely valid. Last night (Pacific Time), @sanjaydasgupta and I had a call to brainstorm how this can be improved. Your filing this issue and continued comments have been very helpful, and so your participation going forward is welcome. One idea that @sanjaydasgupta is currently evaluating is: What if we just modify _validate_upload_parameters() in ludwig/utils/upload_utils.py, such that we do not hard code the path in line 102 (trained_model_artifacts_path = os.path.join(model_path, "model", "model_weights")) but instead only use model_path -- and then enhance the validation so as to make sure that the files we need from the subdirectories model and model_weights are there. In fact, we already have the validation for the model_weights directory contents; we would just need to add the validation for the directory one above (i.e., model) for things that we care about, such as the hyperparameter configuration, which @sanjaydasgupta added last week. Then any model_path would work as long as its contents pass this validation; in this case, modifying tests would be acceptable. What do you both think? Thank you.

PR #3977 addresses this issue; hence, closing. Thank you @osma for reporting it and to @sanjaydasgupta for fixing it!

	# Make sure the model is actually trained
	trained_model_artifacts_path = os.path.join(model_path, "model", "model_weights")
	if not os.path.exists(trained_model_artifacts_path):
	raise Exception(
	f"Model artifacts not found at {trained_model_artifacts_path}. "
	f"It is possible that model at '{model_path}' hasn't been trained yet, or something went"
	"wrong during training where the model's weights were not saved."
	)

upload_to_hf_hub model path mismatch with model.save