Can't register a new version of a model
Eiley2 opened this issue · comments
Hi there,
I'm currently trying to implement this stack at my workplace and facing an issue that I'd like to understand if I'm doing something wrong or if it's a configuration error.
Since we only have 2 workspaces, one for prod and the other for QA and development, with a shared Unity catalog, my idea was to configure the Unity catalog with the name "ml-ops." Then, in the schema, use the model's name, in this case, "prometheus," and within the registry of each schema for each model, name the models as follows: prod-prometheus-model, staging-prometheus-model, and dev-prometheus-model.
For this, I made the following modifications in the code of the following files:
ml-artifacts-asset.yml
resources:
registered_models:
model:
name: ${bundle.target}-${var.model_name}
catalog_name: ml-ops
schema_name: prometheus-model
<<: *grants
depends_on:
- resources.jobs.model_training_job.id
- resources.jobs.batch_inference_job.id
databricks.yml
bundle:
name: ${bundle.target}-${var.model_name}
variables:
experiment_name:
description: Experiment name for model training.
default: /Users/${workspace.current_user.userName}/${bundle.target}-prometheus-experiment
model_name:
description: Model name for model training.
default: prometheus-model
model-workflow-asset.yml
resources:
jobs:
model_training_job:
name: ${bundle.target}-${var.model_name}-model-training-job
job_clusters:
- job_cluster_key: model_training_job_cluster
<<: *new_cluster
tasks:
- task_key: Train
job_cluster_key: model_training_job_cluster
notebook_task:
notebook_path: ../training/notebooks/Train.py
base_parameters:
env: ${bundle.target}
# TODO: Update training_data_path
training_data_path: /databricks-datasets/nyctaxi-with-zipcodes/subsampled
experiment_name: ${var.experiment_name}
model_name: ml-ops.${var.model_name}.${bundle.target}-${var.model_name}
# git source information of the current ML asset deployment. It will be persisted as part of the workflow run
git_source_info: url:${bundle.git.origin_url}; branch:${bundle.git.branch}; commit:${bundle.git.commit}
However, after the first deployment, when the CI/CD pipeline runs:
databricks bundle deploy -t staging
I get the following error:
Updating deployment state...
Error: terraform apply: exit status 1
Error: cannot create registered model: Function or Model 'ml-ops.prometheus-model.staging-prometheus-model' already exists
with databricks_registered_model.model,
on bundle.tf.json line 188, in resource.databricks_registered_model.model:
188: }
Besides that, everything runs perfectly, and I can serve the model without a trouble. Also, I'm using the demo model. I haven't implemented our own yet.
I'm not sure if I'm doing something wrong. Any guidance would be appreciated.
If it's worth something this is the requirements.txt I'm using:
mlflow==2.7.1
numpy>=1.23.0
pandas>=1.4.3
scikit-learn>=1.1.1
matplotlib>=3.5.2
Jinja2==3.0.3
pyspark~=3.3.0
pytz~=2022.2.1
@Eiley2 thanks for opening this issue! A few things:
- The error says that the model already exists. Would you mind confirming if it does, and if so, delete the model if safe? This error prevents accidental overwrite of models that already exist when using
bundle deploy
. - Above you said the schema name is
"prometheus"
but in the code I see the schema name is"prometheus-model"
. Is that intentional? - I wouldn't rename the
bundle:
name:
indatabricks.yml
to something variable dependent since that is the name of the entire bundle as a whole. By default, we use the project name for the bundle name and would recommend doing that as well to prevent unintended behavior down the line.
Thanks for your answer @arpitjasa-db !
- Yeah it does. When I delete it and deploy it for first time and run
databricks bundle deploy -t dev/staging/prod
it'll work perfectly, but when the github actions deploy it again to refresh the code, it throws the error saying that it exists already.
- Yeah, sorry, in the picture you can see it's correct the schema.
- Gotcha, will rename it to prometheus.
Oh you're not using the CI/CD workflows for deployment? Either way this should be working. What is supposed to happen is when you do databricks bundle deploy
it will deploy the resources and mark them as having come from this bundle using a state, so subsequent deploy
s will check that state and only deploy necessary resources, overwriting as necessary.
What seems to be happening is after deploying, the CLI is not recognizing that this resource was created from this bundle but thinks it was created elsewhere, which is why it fails with that error I mentioned above for safety.
Are you running the command from the same directory each time? If so, would you mind opening the .bundle/
subdirectory that was created in that directory and listing out all the contents in its subdirectories?
Your suggestion got me thinking, so I went ahead and deleted the .databricks folder with everything within it, and it did the trick!
Looks like when I was deploying from the CI/CD workflows and I tried debugging, somehow I combined my credentials in my terminal and the service account credentials when I tried to deploy. After deleting the folder the bundle was created from scratch and the CLI recognized it came from this bundle so it solved the error. Thanks for your help!
Awesome glad to hear it!