databricks / databricks-asset-bundles-dais2023

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Upload jar library without syncing other files/folders

MrJSmeets opened this issue · comments

Hi,

I would like to upload an existing JAR as a dependent library to a job/workflow without having to sync any other files/folders.
Currently, all files/folders are always synchronized, but I don't want to sync these. I only need the jar in the target/scala-2.12 folder.

sync:
  include:
    - target/scala-2.12/*.jar

Folder structure:

.
├── README.md
├── build.sbt
├── databricks.yml
├── src
│   └── main
│       ├── resources
│       │   └── ...
│       └── scala
│           └── ...
└── target
    ├── global-logging
    ├── scala-2.12
        └── xxxxxxxxx-assembly-x.x.x.jar

With dbx, this was possible by using file references.
What is the recommended way to do this via DAB, without syncing other files/folders?

I expected this to be possible via artifacts, but that seems to be (for now?) only intended for Python wheels.

By default, DABs exclude files and folders from syncing based on .gitgnore file if you're using Git.
If you're not using Git, or don't want to include certain files in .gitignore you can use sync.exclude property.

sync:
  exclude:
    - src/**/*
    - databricks.yml
    - build.sbt
    - target/global-logging/*

Thanks andrewnester, then it seems that uploading a jar via this synchronisation method is not really the right way for Scala projects. I will instead upload my jar to adls/s3 and put the databricks.yaml file in a subfolder so I don't have to clutter my job definitions with this list of excludes.

Hopefully something similar to the file references at dbx will be available in the future. That made it very useful to upload a JAR with the job definition during local development.

Have there been any updates on this feature? We are also struggling to manage deployments of jar files as part of the DAB deployment. It doesn't seem to work via include because there isn't support for the jar file in artifacts and it complains about not having a relevant artifact specification.

@mike-smith-bb DABs already supports building and automatic upload of JARs, so the configuration can look somewhat like this

artifacts:
  my_java_project:
    path: ./path/to/project
    build: "sbt package"
    type: jar
    files:
      - source: ./path/to/project/targets/*.jar

Note that you have to explicitly specify files source section to point where built jars are located

Also please make sure you're using latest CLI version (0.217.1 as of now).

If you still experience any issues feel free to open issue in CLI repo here https://github.com/databricks/cli/issues

Thanks, @andrewnester . Your suggestion, I think, assumes that we are building the artifact as part of the DAB deployment. What if the jar file is built by a different process and we are simply trying to include it as part of the job cluster created and want to store it in the DAB structure. Is this supported?

@mike-smith-bb yes, indeed.

Then just using sync include section should work, does it for you?

sync:
  include:
    - target/scala-2.12/**/*.jar

Path can be defined in gitingore like syntax so it should be pretty flexible to only match what you need

sync:
  include:
    - target/scala-2.12/**/*.jar

Path can be defined in gitingore like syntax so it should be pretty flexible to only match what you need

@andrewnester — This doesn't seem to work with jar files. Even if I sync the file like you showed, I can't add those jar files as dependencies.

If I do

sync:
  include:
    - resources/lib/*
...
resources:
  jobs:
    my_job:
      name: My Job
      tasks:
        - task_key: mytask
          notebook_task:
            notebook_path: ../src/mymodule/myfile.py
          job_cluster_key: job_cluster
          libraries:
            - jar: /Workspace/${workspace.file_path}/resources/lib/my_custom.jar

I get this error:

image

I'm assuming it's because of this:

image

Do we have any options available to add a jar dependency from source like we used to do with dbx?

We found the same and came to the same conclusion. Seems like we need a pre-deploy step that can inject the jar/dependencies into a volume or cloud storage and also manage the location of the dependency through the sync and dependency configs. Interested in other approaches here.

We found the same and came to the same conclusion. Seems like we need a pre-deploy step that can inject the jar/dependencies into a volume or cloud storage and also manage the location of the dependency through the sync and dependency configs. Interested in other approaches here.

Ok, I will do that in the meantime and see how that goes.

I'm having the same issue. I have a Databricks Volume for jar libraries. My current workaround is just using the was cli to upload the files before deploying the bundle. However, what if it's an internal/managed Volume? I think the databricks could include the option to upload a file to a Volume

Since version 0.224.0 DABs support uploading JARs to UC Volumes, you can find an example here: https://github.com/databricks/bundle-examples/blob/main/knowledge_base/spark_jar_task/databricks.yml

You can omit all artifacts section if you don't want JAR to be rebuilt automatically as part of the deploy and just deploy the one that is being referenced from libraries field

Since version 0.224.0 DABs support uploading JARs to UC Volumes, you can find an example here: https://github.com/databricks/bundle-examples/blob/main/knowledge_base/spark_jar_task/databricks.yml

You can omit all artifacts section if you don't want JAR to be rebuilt automatically as part of the deploy and just deploy the one that is being referenced from libraries field

Thanks for sharing @andrewnester.

What happens if i want to upload the jar and deploy my workflows using the same bundle? If I set the artifact_path to the UC Volume then the whole bundle will be deployed there, no? Though perhaps that wouldn't be a bad thing...

@jmatias no, not really, artifact_path is a path to upload local libraries to, DABs actually doesn't yet support deploying the whole bundle to Volumes (this would be file_path config)