PowerShell Tools for Deploying Databricks Solutions in Azure. These commandlets help you build continuous delivery pipelines and better source control for your scripts.
The CI/CD story in Databricks is complicated. It is designed very much for collaborative working inside the workspace. This probably works well for people doing data science and adhoc queries. But for Data Engineers who what to have build and deployment processes this is usually not good enough.
These tools are designed to help.
These tools should allow you to develop using your preferred methods of Notebooks - in the Databricks Workspace, or via Python or Scala/Java developed in your local IDE.
You can also use these tools to promote code between environments as part of your build and deploy pipelines.
The tools are now being extended to include more management functions, such as creating, starting & stopping your clusters.
Supports Powershell Core 6.1+
See the Wiki for command help.
https://www.powershellgallery.com/packages/azure.databricks.cicd.tools
Install-Module -Name azure.databricks.cicd.tools
or
Save-Module -Name azure.databricks.cicd.tools -Path \psmodules
Followed by:
Import-Module -Name azure.databricks.cicd.tools
To upgrade from a previous version
Update-Module -Name azure.databricks.cicd.tools
Deploys a Secret value to Databricks, this can be a key to a storage account or a password etc. The secret must be created within a scope which will be created for you if it does not exist.
The following commands exist:
- Get-DatabricksClusters - Returns a list of all clusters in your workspace
- New-DatabricksCluster - Creates/Updates a cluster
- Start-DatabricksCluster
- Stop-DatabricksCluster
- Update-DatabricksClusterResize - Modify the number of scale workers
- Remove-DatabricksCluster - Deletes your cluster
- Get-DatabricksNodeTypes - returns a list of valid nodes type (such as DS3v2 etc)
- Get-DatabricksSparkVersions - returns a list of valid versions
Please see the scripts of the parameters. Examples are available in the Tests folder.
These have been designed with CI/CD in mind - ie they should all be idempotent.
- Add-DatabricksDBFSFile - Upload a file or folder to DBFS
- Remove-DatabricksDBFSItem - Delete a file or folder
- Get-DatabricksDBFSFolder - List folder contents
The Add-DatabricksDBFSFile can be used as part of a CI/CD pipeline to upload your source code to DBFS, or dependant libraries. You can also use it to deploy initialisation scripts for your clusters.
Pull down a folder of scripts from your Databricks workspace so that you can commit the files to your Git repo. It is recommended that you set the OutputPath to be inside your Git repo.
Parameters
-BearerToken: Your API token (see Bearer tokens below)
-Region: The Azure Region that hosts your workspace - get this from the start of the URL for your workspace
-ExportPath: The folder inside Databricks you would like to clone. Eg /Shared/MyETL. Must start /
-LocalOutputPath: The local folder to clone the files to. Ideally inside a repo. Can be qualified or relative.
Deploy a folder of scripts from a local folder (Git repo) to a specific folder in your Databricks workspace.
Parameters
-BearerToken: Your API token (see Bearer tokens below)
-Region: The Azure Region that hosts your workspace - get this from the start of the URL for your workspace
-LocalPath: The local folder containing the scripts to deploy. Subfolders will also be deployed.
-DatabricksPath: The folder inside Databricks you would like to deploy into. Eg /Shared/MyETL. Must start /
- Add-DatabricksNotebookJob - Schedule a job based on a Notebook.
- Add-DatabricksNotebookJob - Schedule a job based on a Python script (stored in DBFS).
- Remove-DatabricksJob
Note: There is currently no support for Jar jobs or Spark Submit in this module - it may come in the future (please express an interest in Issues if you would like this). Python jobs do not work in Databricks (see the Jobs UI it is missing as an option). Generally in Azure we would recommend using ADF to execute jobs rather using Databricks jobs.
- Add-DatabricksLibrary
- Get-DatabricksLibraries
See the Wiki for help on the commands. You can also see more examples in the tests folder.
All of the API calls require a Bearer token to authenticate you. To create a token login to your workspace and click on the Person icon in the top right corner. From here go into "User Settings" and click on "Generate New Token". Copy the token into your scripts.
Deployment tasks exist here: https://marketplace.visualstudio.com/items?itemName=DataThirstLtd.databricksDeployScriptsTasks
Note that not all commandlets are available as tasks. Instead you may want to import the module and create PowerShell scripts that use these.