Deploy to Databricks

Deploys a streaming or batch job to Databricks. In the process, if there was already an old version of the job running, this will shut down the old job and deploy the new version.

Most often is used in combination with Deploy artifacts to Azure Blob

Deployment

Add the following task to deployment.yaml:

- task: deploy_to_databricks
  jobs:
  - main_name: "main.py"
    config_file: databricks.json.j2
    lang: python
    name: foo
    run_stream_job_immediately: False
    is_batch: False
    arguments:
    - eventhubs.consumer_group: "my-consumer-group"

This should be after the upload_to_blob task if used together

field description value
jobs A list of job configurations Must have at least one job
jobs[].main_name When lang is python must be the path to the python main file. When lang is scala it must be a class name For python: main/main.py, for scala: com.databricks.ComputeModels.
jobs[].config_file The path to a json jinja templated Databricks job config defaults to databricks.json.j2
jobs[].lang (optional) The language identifier of your project One of python, scala, defaults to python
jobs[].name (optional) A postfix to identify your job on Databricks A postfix of foo will name your job application-name_foo-version. Defaults to no postfix. This will name all the jobs (if you have multiple) the same.
jobs[].run_stream_job_immediately (optional) Whether or not to run a stream job immediately True or False. Defaults to True.
jobs[].is_batch (optional) Designate job as an unscheduled batch True or False. Defaults to False.
jobs[].arguments (optional) Key value pairs to be passed into your project defaults to no arguments
jobs[].use_original_python_filename (optional) If you uploaded multiple unique Python files using the use_original_python_filename in the publish_artifact job, use this flag here too. Only impacts Python files.  

The behaviour of the use_original_python_filename flag:

main_name True False
script.py project-main-1.0.0-script.py project-main-1.0.0.py
script.py project-main-SNAPSHOT-script.py project-main-SNAPSHOT.py
script.py project-main-my_branch-script.py project-main-my_branch.py

The json file can use any of supported keys. During deployment the existence of the key schedule in the json file will determine if the job is streaming or batch. When schedule is present or is_batch has been set to True, it is considered a batch job, otherwise a streaming job. A streaming job will be kicked off immediately upon deployment.

An example of databricks.json.pyspark.j2

{
  "name": "{{ application_name }}",
  "new_cluster": {
    "spark_version": "4.3.x-scala2.11",
    "node_type_id": "Standard_DS4_v2",
    "spark_conf": {
      "spark.sql.warehouse.dir": "dbfs:/mnt/sdh/data/raw/managedtables",
      "spark.databricks.delta.preview.enabled": "true",
      "spark.sql.hive.metastore.jars": "builtin",
      "spark.sql.execution.arrow.enabled": "true",
      "spark.sql.hive.metastore.version": "1.2.1"
    },
    "spark_env_vars": {
      "PYSPARK_PYTHON": "/databricks/python3/bin/python3"
    },
    "num_workers": 1,
    "cluster_log_conf": {
      "dbfs": {
        "destination": "dbfs:/mnt/sdh/logs/{{ log_destination }}"
      }
    }
  },
  "max_retries": 5,
  "libraries": [
    { 
      "egg": "{{ egg_file }}"
    }
  ],
  "spark_python_task": {
    "python_file": "{{ python_file }}",
    "parameters":  {{ parameters | tojson }} 
  }
}

An explanation for the Jinja templated values. These values get resolved automatically during deployment.

field description
application_name your-git-repo-version (e.g. flights-prediction-SNAPSHOT)
log_destination your-git-repo (e.g. flights-prediction)
egg_file The location of the egg file uploaded by the task upload_to_blob
python_file The location the python main file uploaded by the task upload_to_blob

An example of databricks.json.scalaspark.j2

{
  "name": "{{ application_name }}",
  "new_cluster": {
    "spark_version": "4.3.x-scala2.11",
    "node_type_id": "Standard_DS4_v2",
    "spark_conf": {
      "spark.sql.warehouse.dir": "dbfs:/mnt/sdh/data/raw/managedtables",
      "spark.databricks.delta.preview.enabled": "true",
      "spark.sql.hive.metastore.jars": "builtin",
      "spark.sql.execution.arrow.enabled": "true",
      "spark.sql.hive.metastore.version": "1.2.1"
    },
    "spark_env_vars": {
      "PYSPARK_PYTHON": "/databricks/python3/bin/python3"
    },
    "num_workers": 1,
    "cluster_log_conf": {
      "dbfs": {
        "destination": "dbfs:/mnt/sdh/logs/{{ log_destination }}"
      }
    }
  },
  "max_retries": 5,
  "libraries": [
    { 
      "jar": "{{ jar_file }}"
    }
  ],
  "spark_jar_task": {
    "main_class_name": "{{ class_name }}",
    "parameters":  {{ parameters | tojson }} 
  }
}

An explanation for the Jinja templated values. These values get resolved automatically during deployment.

field description
application_name your-git-repo-version (e.g. flights-prediction-SNAPSHOT)
log_destination your-git-repo (e.g. flights-prediction)
jar_file The location of the jar file uploaded by the task upload_to_blob
class_name The class in the jar that should be ran

Takeoff config

Make sure takeoff_config.yaml contains the following azure_keyvault_keys:

  azure_storage_account:
    account_name: "azure-shared-blob-username"
    account_key: "azure-shared-blob-password"

and these takeoff_common keys:

  artifacts_shared_blob_container_name: libraries

Removal of old jobs

The deploy_to_databricks step will try to remove any existing jobs whose name matches with the new one it is deploying. There are few things to note here:

  1. If you change the job name (e.g. add/remove/update the name field in the deployment config) Takeoff will not recognise the existing job as being the same. It will therefore not remove it.
  2. When running Takeoff on a git tag, you obviously also will be changing the name of the job. For example, if you had version 1.0.0 running before, you now deployed 1.1.0, Takeoff will look for job 1.1.0 to kill, won’t find it, and will leave version 1.0.0 running along with your new job 1.1.0.
  3. Takeoff will not remove your Databricks job if/when you close your branch (either by removing the branch or by merging a Pull/Merge Request).