1. Benefits of Using Multiple Tasks in Databricks Jobs

What are Multi-Task Jobs?

Jobs in Databricks can consist of multiple tasks that run in a specified order, with dependencies between them. This creates a workflow pipeline.

Key Benefits

  1. Modularity

    • Break complex workflows into smaller, manageable tasks (e.g., ingest β†’ transform β†’ analyze).
    • Easier debugging and maintenance.
  2. Parallel Execution

    • Independent tasks can run in parallel (e.g., processing different datasets simultaneously).
  3. Conditional Execution

    • Tasks can depend on the success/failure of previous tasks.
  4. Reusability

    • The same task can be reused across multiple jobs.
  5. Resource Optimization

    • Assign different clusters to different tasks based on workload needs.

Example Workflow

Task 1 (Ingest) β†’ Task 2 (Clean) β†’ Task 3 (Aggregate)
  • If Task 1 fails, downstream tasks (Task 2, Task 3) are skipped.

2. Setting Up a Predecessor Task in Jobs

What is a Predecessor Task?

A task that must complete before another task (successor) can run.

How to Set Up

  1. In Databricks Jobs UI:

    • Create a new job with multiple tasks.
    • In the task settings, select β€œDepends on” and choose the predecessor task.
  2. Using Jobs API:

    {
      "task_key": "transform_data",
      "depends_on": [{"task_key": "ingest_data"}]
    }
    

Example Scenario

  • Task 1 (ingest_data): Loads raw data.
  • Task 2 (transform_data): Cleans and processes data (depends on ingest_data).

3. When to Use Predecessor Tasks

Common Scenarios

  1. Data Dependency

    • A task requires output from a previous task (e.g., raw data must be ingested before transformation).
  2. Error Handling

    • If an early task fails, downstream tasks should not execute (e.g., avoid processing incomplete data).
  3. Cost Optimization

    • Skip expensive computations if upstream validation fails.

Example

validate_input β†’ (if valid) β†’ process_data β†’ generate_report
  • If validate_input fails, the pipeline stops early.

4. Reviewing a Task’s Execution History

Why Review Execution History?

  • Debug failures.
  • Monitor performance (duration, resource usage).
  • Audit job runs.

How to Access

  1. Databricks UI:

    • Navigate to β€œJobs” β†’ Select job β†’ β€œRuns” tab.
    • Click on a run to see task history.
  2. Key Details Available:

    • Start/end time.
    • Status (Success, Failed, Skipped).
    • Logs (stdout, stderr).
    • Cluster metrics (CPU, memory).

Example Debugging Flow

  1. Find failed run β†’ Check logs.
  2. Identify error (e.g., FileNotFound).
  3. Fix issue (e.g., correct input path).

5. CRON Scheduling for Jobs

What is CRON?

A time-based job scheduler in Unix systems. Databricks supports CRON expressions for scheduling jobs.

Syntax

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ minute (0 - 59)
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ hour (0 - 23)
β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ day of month (1 - 31)
β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ month (1 - 12)
β”‚ β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ day of week (0 - 6, Sun-Sat)
β”‚ β”‚ β”‚ β”‚ β”‚
* * * * *

Examples

ScheduleCRON Expression
Daily at 2 AM0 2 * * *
Every Monday0 0 * * 1
Every 15 mins*/15 * * * *

How to Set Up

  1. In Job settings β†’ β€œSchedule” β†’ β€œCron Schedule”.
  2. Enter expression (e.g., 0 0 * * * for daily midnight runs).

6. Debugging a Failed Task

Steps to Debug

  1. Check Run Logs:

    • Navigate to the failed run β†’ β€œLogs” tab.
    • Look for errors (e.g., Exception: File not found).
  2. Reproduce Locally:

    • Run the notebook interactively with the same inputs.
  3. Common Issues:

    • Missing data/files.
    • Permission errors.
    • Syntax errors in code.

Example Fix

  • Error: AnalysisException: Table not found.
  • Solution: Correct table name or ensure table exists.

7. Setting Up a Retry Policy

Why Retry?

  • Handle transient failures (e.g., network issues).
  • Avoid manual intervention.

Configuration Options

  1. Number of Retries: Max attempts (default: 0).
  2. Retry Delay: Wait time between retries (e.g., 5 mins).

How to Set Up

  1. UI:

    • In task settings β†’ β€œRetry Policy” β†’ Set max retries and delay.
  2. API:

    {
      "retry_on_timeout": true,
      "max_retries": 3,
      "min_retry_interval_millis": 300000
    }
    

Example

  • Task fails due to temporary API outage β†’ Retries 3x with 5-minute gaps.

8. Creating Alerts for Failed Tasks

Why Alert?

  • Get notified immediately when a job fails.
  • Reduce downtime.

Alert Options

  1. Email Notifications:
    • Send alerts to individuals or groups.
  2. Webhooks:
    • Integrate with Slack, PagerDuty, etc.

How to Set Up

  1. UI:

    • Navigate to Jobs β†’ Select job β†’ β€œAlerts” tab.
    • Add email/webhook.
  2. API:

    {
      "email_notifications": {
        "on_failure": ["user@example.com"]
      }
    }
    

Example

  • Job fails β†’ Email sent to team@company.com.

9. Email Alerts for Failed Tasks

How It Works

  • Databricks sends an email to specified addresses when:
    • A task fails.
    • The entire job fails.

Configuration

  1. UI:

    • Job settings β†’ β€œNotifications” β†’ Add email.
  2. Limitations:

    • Only supports email (for advanced integrations, use webhooks).

Example Email Content

Subject: Job Failed - "daily_etl" (Run ID: 123)
Details: Task "transform_data" failed at 2023-10-01 02:00.
Error: FileNotFoundError: No such file: /data/input.csv

Summary Table: Key Concepts

TopicKey Takeaway
Multi-Task JobsBreak workflows into modular, parallelizable tasks with dependencies.
Predecessor TasksEnsure tasks run in order (e.g., ingest β†’ transform).
CRON SchedulingUse expressions like 0 0 * * * for daily runs.
Retry PoliciesConfigure retries (e.g., 3 attempts) for transient failures.
AlertsNotify via email/webhook when jobs fail.