colemanja91 / ansible-databricks

Manage Databricks configurations via Ansible

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Ansible Databricks

Galaxy role to manage Databricks resources and configurations. Helpful for easily keeping mission-critical items under source control. Uses the Databricks CLI, and attempts to apply idempotency to most configurable components.

Prerequisites

  • Databricks organization account set up in AWS or Azure
  • Databricks user account within your organization
  • Ansible >= 2.6
  • Token access to Databricks

Using in your Ansible playbook

  • Install in your Ansible repo: ansible-galaxy install colemanja91.ansible-databricks
  • Example playbook:
---
- hosts:
    - localhost
  vars_files:
    - "my/secret/file.yml"
    - "my/ansible/variables.yml"
  roles:
    - { role: colemanja91.ansible-databricks }

Tasks

CLI installation and setup

  • By default, attempts to install the CLI via pip
  • Sets up configuration file
  • Expects either an Ansible variable databricks_token or environment variable DATABRICKS_TOKEN to be defined
    • Recommended for each Ansible user to define the environment variable at their system-level, to ensure they are using their own account and have proper permissions
    • Ansible variable should be used only with a shared Databricks account (not recommended)
  • Automatically run for any role execution

DBFS mounts

ansible-playbook databricks.yml -t dbfs
  • The variable databricks_dbfs is used to configure this task:
databricks_dbfs:
  - s3_path: "s3a://my-s3-bucket-name"
    dbfs_mount: "/mnt/my-dbfs-mount"

Databricks Secrets

databricks_secrets:
  - scope: "my_secret_scope"
    key: "my_secret_name"
    value: "{{ my_secret_variable }}"

Libraries

  • NOTE: Currently only libraries used on Databricks Jobs are supported
  • Support for interactive cluster libraries is TBD
  • Adds the target file from local file system to a given DBFS path
  • The variable databricks_libraries is used to configure this task:
databricks_libraries:
  - src: "../path/to/my/jar.jar"
    dbfs: "dbfs:/target/path/to/my/jar.jar"

Jobs

databricks_jobs:
  - name: "my_job"
    notebook_task:
      notebook_path: "/User/Jeremy/my_notebook"
    new_cluster:
      autoscale:
        min_workers: 2
        max_workers: 4
      spark_version: "4.3.x-scala2.11"
      node_type_id: "r4.2xlarge"
      aws_attributes:
        first_on_demand: 0
        availability: ON_DEMAND
        zone_id: "{{ aws_zone }}"
        instance_profile_arn: "{{ aws_instance_profile_arn }}"
        ebs_volume_type: GENERAL_PURPOSE_SSD
        ebs_volume_count: 1
        ebs_volume_size: 100
      custom_tags:
        - key: environment
          value: "production"
      spark_env_vars:
        - key: "ENVIRONMENT"
          value: "production"
      enable_elastic_disk: true
    libraries:
      - jar: "dbfs:/target/path/to/my/jar.jar"
    email_notifications:
      on_start: []
      on_success: []
      on_failure:
        - example@example.com
    max_concurrent_runs: 1

About

Manage Databricks configurations via Ansible

License:Apache License 2.0