User Guide¶
This is the user guide for Kedro-Databricks. It contains detailed instructions on how to use the plugin, including how to customize it for your specific needs.
Configuration¶
You can change how kedro-databricks works by configuring it as a tool in your pyproject.toml. For example:
# pyproject.toml
[tool.kedro-databricks]
init_schema = "default"
init_catalog = "workspace"
default_env = "dev"
The available configuration options are:
conf_source = Field(default='conf')
¶
Path of a directory where project configuration is stored
default_env = Field(default='dev')
¶
Default target environment for kedro-databricks commands.
init_catalog = Field(default='workspace')
¶
Default catalog for kedro databricks init
init_schema = Field(default='default')
¶
Default schema for kedro databricks init
regex_prefix = Field(default='re:')
¶
Prefix to use for discovering regex workflow or task overrides. Must end in ':'
workflow_default_key = Field(default='default')
¶
Default key to use for overrides in kedro databricks bundle
workflow_generator = Field(default='node')
¶
Default generator to use for generating Databricks Asset Bundle resources
Resource Generation¶
You can choose how resources are generated using -g/--resource-generator:
node(default): creates a job task for each Kedro node with dependencies.pipeline: creates a single task that runs the entire pipeline.
You can also provide a fully-qualified dotted path to a custom generator class
that subclasses kedro_databricks.resource_generator.AbstractResourceGenerator.
Examples:
# Generate per-node tasks (default behavior)
kedro databricks bundle -g node
# Generate a single-task job for the whole pipeline
kedro databricks bundle -g pipeline
# Bundle only one pipeline by name
kedro databricks bundle -g pipeline -p my_pipeline
# Pass runtime parameters to tasks
kedro databricks bundle -g node -r "param1=val1,param2=val2"
Tip: The same -g/--resource-generator, -p/--pipeline, and -r/--params options are also available when using kedro databricks deploy --bundle.
Creating a custom resource generator¶
You can implement your own generator by subclassing
kedro_databricks.resource_generator.AbstractResourceGenerator and
returning a Databricks job payload in _create_job_dict. For example,
the snippet below creates per-node tasks and attaches a custom cluster:
# my_project/generators/custom.py
from __future__ import annotations
from typing import Any
from kedro.pipeline import Pipeline
from kedro_databricks.resource_generator.abstract_resource_generator import (
AbstractResourceGenerator,
)
class CustomGenerator(AbstractResourceGenerator):
"""Example generator with a predefined job cluster per task."""
def _create_job_dict(
self, name: str, pipeline: Pipeline, pipeline_name: str
) -> dict[str, Any]:
# Your custom logic
return {"name": name, "tasks": []}
Use your custom generator by passing its dotted path with -g/--resource-generator:
Override Behavior¶
Override files are read from conf/<env>/databricks.yml and merged into generated resources.
resources.jobs.<job-name>applies to a specific generated job.resources.jobs.<default-key>applies to all jobs (default key isdefault, configurable through--default-keyor project config).- Regex keys are supported using the configured
regex_prefix(defaultre:).
Example with job-level regex override:
Task-level regex overrides are also supported under resources.jobs.<job>.tasks:
resources:
jobs:
my_project_default:
tasks:
- task_key: default
timeout_seconds: 7200
- task_key: re:^train_.*
max_retries: 2
Resource Type Support¶
You can define any Databricks resource type under resources.
jobsuse specialized merge logic (for tasks, environments, clusters, notifications, health rules, and parameters).- Other resource types use shallow key-based merging and are written as-is into generated bundle files.
This makes it possible to combine generated jobs with manually maintained resources such as volumes, while still using advanced merge behavior for jobs.
Capability Matrix¶
The table below maps common capabilities to where they are documented and where you can find concrete examples.
| Capability | Main command/options | Primary docs | Example directory |
|---|---|---|---|
| Initialize bundle scaffold | kedro databricks init |
Getting Started | n/a |
| Reinitialize existing project | kedro databricks init --overwrite |
Getting Started | n/a |
| Generate node-level jobs | kedro databricks bundle -g node |
User Guide | examples/individual_task/ |
| Generate pipeline-level jobs | kedro databricks bundle -g pipeline |
User Guide | examples/individual_workflows/ |
| Pass runtime params to tasks | kedro databricks bundle -r |
User Guide | examples/individual_task_with_parameters/ |
| Apply defaults and targeted overrides | --default-key + conf/<env>/databricks.yml |
User Guide | examples/using_default_overrides/ |
| Apply regex overrides (jobs/tasks) | re: keys, --regex-prefix |
User Guide | examples/regex_overrides/, examples/task_regex_overrides/ |
| Use advanced Jobs API fields | override config | User Guide | examples/with_all_jobs_api_2.2_fields/ |
| Add custom libraries | override config | User Guide | examples/with_custom_libraries/ |
| Configure health rules | override config | User Guide | examples/with_health_rules/ |
| Configure job parameters | override config | User Guide | examples/with_job_parameters/ |
| Configure webhook notifications | override config | User Guide | examples/webhook_notifications/, examples/task_level_webhook_notifications/ |
| Deploy and upload local data | kedro databricks deploy |
Getting Started | n/a |
| Run jobs on Databricks | kedro databricks run [pipeline] |
Getting Started | n/a |
| Forward raw Databricks CLI args | -- ... |
Getting Started | n/a |
| Destroy deployed resources | kedro databricks destroy |
Getting Started | n/a |