Deploy to Databricks

Deploys a streaming or batch job to Databricks. In the process, if there was already an old version of the job running, this will shut down the old job and deploy the new version.

Most often is used in combination with Deploy artifacts to Azure Blob

Deployment

Add the following task to deployment.yaml:

- task: deploy_to_databricks
  jobs:
  - main_name: "main.py"
    config_file: databricks.json.j2
    lang: python
    name: foo
    run_stream_job_immediately: False
    is_batch: False
    arguments:
    - eventhubs.consumer_group: "my-consumer-group"

This should be after the upload_to_blob task if used together

field	description	value
`jobs`	A list of job configurations	Must have at least one job
`jobs[].main_name`	When `lang` is `python` must be the path to the python main file. When `lang` is `scala` it must be a class name	For `python`: `main/main.py`, for `scala`: `com.databricks.ComputeModels`.
`jobs[].config_file`	The path to a `json` jinja templated Databricks job config	defaults to `databricks.json.j2`
`jobs[].lang` (optional)	The language identifier of your project	One of `python`, `scala`, defaults to `python`
`jobs[].name` (optional)	A postfix to identify your job on Databricks	A postfix of `foo` will name your job `application-name_foo-version`. Defaults to no postfix. This will name all the jobs (if you have multiple) the same.
`jobs[].run_stream_job_immediately` (optional)	Whether or not to run a stream job immediately	`True` or `False`. Defaults to `True`.
`jobs[].is_batch` (optional)	Designate job as an unscheduled batch	`True` or `False`. Defaults to `False`.
`jobs[].arguments` (optional)	Key value pairs to be passed into your project	defaults to no arguments
`jobs[].use_original_python_filename` (optional)	If you uploaded multiple unique Python files using the `use_original_python_filename` in the `publish_artifact` job, use this flag here too. Only impacts Python files.

The behaviour of the use_original_python_filename flag:

main_name	True	False
`script.py`	`project-main-1.0.0-script.py`	`project-main-1.0.0.py`
`script.py`	`project-main-SNAPSHOT-script.py`	`project-main-SNAPSHOT.py`
`script.py`	`project-main-my_branch-script.py`	`project-main-my_branch.py`

The json file can use any of supported keys. During deployment the existence of the key schedule in the json file will determine if the job is streaming or batch. When schedule is present or is_batch has been set to True, it is considered a batch job, otherwise a streaming job. A streaming job will be kicked off immediately upon deployment.

An example of databricks.json.pyspark.j2

{
  "name": "{{ application_name }}",
  "new_cluster": {
    "spark_version": "4.3.x-scala2.11",
    "node_type_id": "Standard_DS4_v2",
    "spark_conf": {
      "spark.sql.warehouse.dir": "dbfs:/mnt/sdh/data/raw/managedtables",
      "spark.databricks.delta.preview.enabled": "true",
      "spark.sql.hive.metastore.jars": "builtin",
      "spark.sql.execution.arrow.enabled": "true",
      "spark.sql.hive.metastore.version": "1.2.1"
    },
    "spark_env_vars": {
      "PYSPARK_PYTHON": "/databricks/python3/bin/python3"
    },
    "num_workers": 1,
    "cluster_log_conf": {
      "dbfs": {
        "destination": "dbfs:/mnt/sdh/logs/{{ log_destination }}"
      }
    }
  },
  "max_retries": 5,
  "libraries": [
    { 
      "egg": "{{ egg_file }}"
    }
  ],
  "spark_python_task": {
    "python_file": "{{ python_file }}",
    "parameters":  {{ parameters | tojson }} 
  }
}

An explanation for the Jinja templated values. These values get resolved automatically during deployment.

field	description
`application_name`	`your-git-repo-version` (e.g. `flights-prediction-SNAPSHOT`)
`log_destination`	`your-git-repo` (e.g. `flights-prediction`)
`egg_file`	The location of the egg file uploaded by the task upload_to_blob
`python_file`	The location the python main file uploaded by the task upload_to_blob

An example of databricks.json.scalaspark.j2

{
  "name": "{{ application_name }}",
  "new_cluster": {
    "spark_version": "4.3.x-scala2.11",
    "node_type_id": "Standard_DS4_v2",
    "spark_conf": {
      "spark.sql.warehouse.dir": "dbfs:/mnt/sdh/data/raw/managedtables",
      "spark.databricks.delta.preview.enabled": "true",
      "spark.sql.hive.metastore.jars": "builtin",
      "spark.sql.execution.arrow.enabled": "true",
      "spark.sql.hive.metastore.version": "1.2.1"
    },
    "spark_env_vars": {
      "PYSPARK_PYTHON": "/databricks/python3/bin/python3"
    },
    "num_workers": 1,
    "cluster_log_conf": {
      "dbfs": {
        "destination": "dbfs:/mnt/sdh/logs/{{ log_destination }}"
      }
    }
  },
  "max_retries": 5,
  "libraries": [
    { 
      "jar": "{{ jar_file }}"
    }
  ],
  "spark_jar_task": {
    "main_class_name": "{{ class_name }}",
    "parameters":  {{ parameters | tojson }} 
  }
}

An explanation for the Jinja templated values. These values get resolved automatically during deployment.

field	description
`application_name`	`your-git-repo-version` (e.g. `flights-prediction-SNAPSHOT`)
`log_destination`	`your-git-repo` (e.g. `flights-prediction`)
`jar_file`	The location of the jar file uploaded by the task upload_to_blob
`class_name`	The class in the jar that should be ran

Takeoff config

Make sure takeoff_config.yaml contains the following azure_keyvault_keys:

  azure_storage_account:
    account_name: "azure-shared-blob-username"
    account_key: "azure-shared-blob-password"

and these takeoff_common keys:

  artifacts_shared_blob_container_name: libraries

Removal of old jobs

The deploy_to_databricks step will try to remove any existing jobs whose name matches with the new one it is deploying. There are few things to note here:

If you change the job name (e.g. add/remove/update the name field in the deployment config) Takeoff will not recognise the existing job as being the same. It will therefore not remove it.
When running Takeoff on a git tag, you obviously also will be changing the name of the job. For example, if you had version 1.0.0 running before, you now deployed 1.1.0, Takeoff will look for job 1.1.0 to kill, won’t find it, and will leave version 1.0.0 running along with your new job 1.1.0.
Takeoff will not remove your Databricks job if/when you close your branch (either by removing the branch or by merging a Pull/Merge Request).

Azure Databricks jobs

Deploy to Databricks

Deployment

Takeoff config

Removal of old jobs