Deploy to Databricks
Deploys a streaming or batch job to Databricks. In the process, if there was already an old version of the job running, this will shut down the old job and deploy the new version.
Most often is used in combination with Deploy artifacts to Azure Blob
Deployment
Add the following task to deployment.yaml
:
- task: deploy_to_databricks
jobs:
- main_name: "main.py"
config_file: databricks.json.j2
lang: python
name: foo
run_stream_job_immediately: False
is_batch: False
arguments:
- eventhubs.consumer_group: "my-consumer-group"
This should be after the upload_to_blob task if used together
field | description | value |
---|---|---|
jobs |
A list of job configurations | Must have at least one job |
jobs[].main_name |
When lang is python must be the path to the python main file. When lang is scala it must be a class name |
For python : main/main.py , for scala : com.databricks.ComputeModels . |
jobs[].config_file |
The path to a json jinja templated Databricks job config |
defaults to databricks.json.j2 |
jobs[].lang (optional) |
The language identifier of your project | One of python , scala , defaults to python |
jobs[].name (optional) |
A postfix to identify your job on Databricks | A postfix of foo will name your job application-name_foo-version . Defaults to no postfix. This will name all the jobs (if you have multiple) the same. |
jobs[].run_stream_job_immediately (optional) |
Whether or not to run a stream job immediately | True or False . Defaults to True . |
jobs[].is_batch (optional) |
Designate job as an unscheduled batch | True or False . Defaults to False . |
jobs[].arguments (optional) |
Key value pairs to be passed into your project | defaults to no arguments |
jobs[].use_original_python_filename (optional) |
If you uploaded multiple unique Python files using the use_original_python_filename in the publish_artifact job, use this flag here too. Only impacts Python files. |
The behaviour of the use_original_python_filename
flag:
main_name | True | False |
---|---|---|
script.py |
project-main-1.0.0-script.py |
project-main-1.0.0.py |
script.py |
project-main-SNAPSHOT-script.py |
project-main-SNAPSHOT.py |
script.py |
project-main-my_branch-script.py |
project-main-my_branch.py |
The json
file can use any of supported keys. During deployment the existence of the key schedule
in the json
file will determine if the job is streaming or batch. When schedule
is present or is_batch
has been set to True
, it is considered a batch job, otherwise a streaming job. A streaming job will be kicked off immediately upon deployment.
An example of databricks.json.pyspark.j2
{
"name": "{{ application_name }}",
"new_cluster": {
"spark_version": "4.3.x-scala2.11",
"node_type_id": "Standard_DS4_v2",
"spark_conf": {
"spark.sql.warehouse.dir": "dbfs:/mnt/sdh/data/raw/managedtables",
"spark.databricks.delta.preview.enabled": "true",
"spark.sql.hive.metastore.jars": "builtin",
"spark.sql.execution.arrow.enabled": "true",
"spark.sql.hive.metastore.version": "1.2.1"
},
"spark_env_vars": {
"PYSPARK_PYTHON": "/databricks/python3/bin/python3"
},
"num_workers": 1,
"cluster_log_conf": {
"dbfs": {
"destination": "dbfs:/mnt/sdh/logs/{{ log_destination }}"
}
}
},
"max_retries": 5,
"libraries": [
{
"egg": "{{ egg_file }}"
}
],
"spark_python_task": {
"python_file": "{{ python_file }}",
"parameters": {{ parameters | tojson }}
}
}
An explanation for the Jinja templated values. These values get resolved automatically during deployment.
field | description |
---|---|
application_name |
your-git-repo-version (e.g. flights-prediction-SNAPSHOT ) |
log_destination |
your-git-repo (e.g. flights-prediction ) |
egg_file |
The location of the egg file uploaded by the task upload_to_blob |
python_file |
The location the python main file uploaded by the task upload_to_blob |
An example of databricks.json.scalaspark.j2
{
"name": "{{ application_name }}",
"new_cluster": {
"spark_version": "4.3.x-scala2.11",
"node_type_id": "Standard_DS4_v2",
"spark_conf": {
"spark.sql.warehouse.dir": "dbfs:/mnt/sdh/data/raw/managedtables",
"spark.databricks.delta.preview.enabled": "true",
"spark.sql.hive.metastore.jars": "builtin",
"spark.sql.execution.arrow.enabled": "true",
"spark.sql.hive.metastore.version": "1.2.1"
},
"spark_env_vars": {
"PYSPARK_PYTHON": "/databricks/python3/bin/python3"
},
"num_workers": 1,
"cluster_log_conf": {
"dbfs": {
"destination": "dbfs:/mnt/sdh/logs/{{ log_destination }}"
}
}
},
"max_retries": 5,
"libraries": [
{
"jar": "{{ jar_file }}"
}
],
"spark_jar_task": {
"main_class_name": "{{ class_name }}",
"parameters": {{ parameters | tojson }}
}
}
An explanation for the Jinja templated values. These values get resolved automatically during deployment.
field | description |
---|---|
application_name |
your-git-repo-version (e.g. flights-prediction-SNAPSHOT ) |
log_destination |
your-git-repo (e.g. flights-prediction ) |
jar_file |
The location of the jar file uploaded by the task upload_to_blob |
class_name |
The class in the jar that should be ran |
Takeoff config
Make sure takeoff_config.yaml
contains the following azure_keyvault_keys
:
azure_storage_account:
account_name: "azure-shared-blob-username"
account_key: "azure-shared-blob-password"
and these takeoff_common
keys:
artifacts_shared_blob_container_name: libraries
Removal of old jobs
The deploy_to_databricks
step will try to remove any existing jobs whose name matches with the new one it is deploying. There are few things to note here:
- If you change the job name (e.g. add/remove/update the
name
field in the deployment config) Takeoff will not recognise the existing job as being the same. It will therefore not remove it. - When running Takeoff on a git tag, you obviously also will be changing the name of the job. For example, if you had version 1.0.0 running before, you now deployed 1.1.0, Takeoff will look for job 1.1.0 to kill, won’t find it, and will leave version 1.0.0 running along with your new job 1.1.0.
- Takeoff will not remove your Databricks job if/when you close your branch (either by removing the branch or by merging a Pull/Merge Request).