1. Benefits of Using Multiple Tasks in Databricks Jobs

What are Multi-Task Jobs?

Jobs in Databricks can consist of multiple tasks that run in a specified order, with dependencies between them. This creates a workflow pipeline.

Key Benefits

Modularity
- Break complex workflows into smaller, manageable tasks (e.g., ingest → transform → analyze).
- Easier debugging and maintenance.
Parallel Execution
- Independent tasks can run in parallel (e.g., processing different datasets simultaneously).
Conditional Execution
- Tasks can depend on the success/failure of previous tasks.
Reusability
- The same task can be reused across multiple jobs.
Resource Optimization
- Assign different clusters to different tasks based on workload needs.

Example Workflow

Task 1 (Ingest) → Task 2 (Clean) → Task 3 (Aggregate)

If Task 1 fails, downstream tasks (Task 2, Task 3) are skipped.

2. Setting Up a Predecessor Task in Jobs

What is a Predecessor Task?

A task that must complete before another task (successor) can run.

How to Set Up

In Databricks Jobs UI:
- Create a new job with multiple tasks.
- In the task settings, select “Depends on” and choose the predecessor task.

Using Jobs API:

{
  "task_key": "transform_data",
  "depends_on": [{"task_key": "ingest_data"}]
}

Example Scenario

Task 1 (ingest_data): Loads raw data.
Task 2 (transform_data): Cleans and processes data (depends on ingest_data).

3. When to Use Predecessor Tasks

Common Scenarios

Data Dependency
- A task requires output from a previous task (e.g., raw data must be ingested before transformation).
Error Handling
- If an early task fails, downstream tasks should not execute (e.g., avoid processing incomplete data).
Cost Optimization
- Skip expensive computations if upstream validation fails.

Example

validate_input → (if valid) → process_data → generate_report

If validate_input fails, the pipeline stops early.

4. Reviewing a Task’s Execution History

Why Review Execution History?

Debug failures.
Monitor performance (duration, resource usage).
Audit job runs.

How to Access

Databricks UI:
- Navigate to “Jobs” → Select job → “Runs” tab.
- Click on a run to see task history.
Key Details Available:
- Start/end time.
- Status (Success, Failed, Skipped).
- Logs (stdout, stderr).
- Cluster metrics (CPU, memory).

Example Debugging Flow

Find failed run → Check logs.
Identify error (e.g., FileNotFound).
Fix issue (e.g., correct input path).

5. CRON Scheduling for Jobs

What is CRON?

A time-based job scheduler in Unix systems. Databricks supports CRON expressions for scheduling jobs.

Syntax

┌───────────── minute (0 - 59)
│ ┌───────────── hour (0 - 23)
│ │ ┌───────────── day of month (1 - 31)
│ │ │ ┌───────────── month (1 - 12)
│ │ │ │ ┌───────────── day of week (0 - 6, Sun-Sat)
│ │ │ │ │
* * * * *

Examples

Schedule	CRON Expression
Daily at 2 AM	`0 2 * * *`
Every Monday	`0 0 * * 1`
Every 15 mins	`/15 * * *`

How to Set Up

In Job settings → “Schedule” → “Cron Schedule”.
Enter expression (e.g., 0 0 * * * for daily midnight runs).

6. Debugging a Failed Task

Steps to Debug

Check Run Logs:
- Navigate to the failed run → “Logs” tab.
- Look for errors (e.g., Exception: File not found).
Reproduce Locally:
- Run the notebook interactively with the same inputs.
Common Issues:
- Missing data/files.
- Permission errors.
- Syntax errors in code.

Example Fix

Error: AnalysisException: Table not found.
Solution: Correct table name or ensure table exists.

7. Setting Up a Retry Policy

Why Retry?

Handle transient failures (e.g., network issues).
Avoid manual intervention.

Configuration Options

Number of Retries: Max attempts (default: 0).
Retry Delay: Wait time between retries (e.g., 5 mins).

How to Set Up

UI:
- In task settings → “Retry Policy” → Set max retries and delay.

API:

{
  "retry_on_timeout": true,
  "max_retries": 3,
  "min_retry_interval_millis": 300000
}

Example

Task fails due to temporary API outage → Retries 3x with 5-minute gaps.

8. Creating Alerts for Failed Tasks

Why Alert?

Get notified immediately when a job fails.
Reduce downtime.

Alert Options

Email Notifications:
- Send alerts to individuals or groups.
Webhooks:
- Integrate with Slack, PagerDuty, etc.

How to Set Up

UI:
- Navigate to Jobs → Select job → “Alerts” tab.
- Add email/webhook.

API:

{
  "email_notifications": {
    "on_failure": ["user@example.com"]
  }
}

Example

Job fails → Email sent to team@company.com.

9. Email Alerts for Failed Tasks

How It Works

Databricks sends an email to specified addresses when:
- A task fails.
- The entire job fails.

Configuration

UI:
- Job settings → “Notifications” → Add email.
Limitations:
- Only supports email (for advanced integrations, use webhooks).

Example Email Content

Subject: Job Failed - "daily_etl" (Run ID: 123)
Details: Task "transform_data" failed at 2023-10-01 02:00.
Error: FileNotFoundError: No such file: /data/input.csv

Summary Table: Key Concepts

Topic	Key Takeaway
Multi-Task Jobs	Break workflows into modular, parallelizable tasks with dependencies.
Predecessor Tasks	Ensure tasks run in order (e.g., ingest → transform).
CRON Scheduling	Use expressions like `0 0 * * *` for daily runs.
Retry Policies	Configure retries (e.g., 3 attempts) for transient failures.
Alerts	Notify via email/webhook when jobs fail.

Data Engineer Associate

Production Pipelines

1. Benefits of Using Multiple Tasks in Databricks Jobs

What are Multi-Task Jobs?

Key Benefits

Example Workflow

2. Setting Up a Predecessor Task in Jobs

What is a Predecessor Task?

How to Set Up

Example Scenario

3. When to Use Predecessor Tasks

Common Scenarios

Example

4. Reviewing a Task’s Execution History

Why Review Execution History?

How to Access

Example Debugging Flow

5. CRON Scheduling for Jobs

What is CRON?

Syntax

Examples

How to Set Up

6. Debugging a Failed Task

Steps to Debug

Example Fix

7. Setting Up a Retry Policy

Why Retry?

Configuration Options

How to Set Up

Example

8. Creating Alerts for Failed Tasks

Why Alert?

Alert Options

How to Set Up

Example

9. Email Alerts for Failed Tasks

How It Works

Configuration

Example Email Content

Summary Table: Key Concepts

Data Engineer Associate

​1. Benefits of Using Multiple Tasks in Databricks Jobs

​What are Multi-Task Jobs?

​Key Benefits

​Example Workflow

​2. Setting Up a Predecessor Task in Jobs

​What is a Predecessor Task?

​How to Set Up

​Example Scenario

​3. When to Use Predecessor Tasks

​Common Scenarios

​Example

​4. Reviewing a Task’s Execution History

​Why Review Execution History?

​How to Access

​Example Debugging Flow

​5. CRON Scheduling for Jobs

​What is CRON?

​Syntax

​Examples

​How to Set Up

​6. Debugging a Failed Task

​Steps to Debug

​Example Fix

​7. Setting Up a Retry Policy

​Why Retry?

​Configuration Options

​How to Set Up

​Example

​8. Creating Alerts for Failed Tasks

​Why Alert?

​Alert Options

​How to Set Up

​Example

​9. Email Alerts for Failed Tasks

​How It Works

​Configuration

​Example Email Content

​Summary Table: Key Concepts

1. Benefits of Using Multiple Tasks in Databricks Jobs

What are Multi-Task Jobs?

Key Benefits

Example Workflow

2. Setting Up a Predecessor Task in Jobs

What is a Predecessor Task?

How to Set Up

Example Scenario

3. When to Use Predecessor Tasks

Common Scenarios

Example

4. Reviewing a Task’s Execution History

Why Review Execution History?

How to Access

Example Debugging Flow

5. CRON Scheduling for Jobs

What is CRON?

Syntax

Examples

How to Set Up

6. Debugging a Failed Task

Steps to Debug

Example Fix

7. Setting Up a Retry Policy

Why Retry?

Configuration Options

How to Set Up

Example

8. Creating Alerts for Failed Tasks

Why Alert?

Alert Options

How to Set Up

Example

9. Email Alerts for Failed Tasks

How It Works

Configuration

Example Email Content

Summary Table: Key Concepts