Data Engineer Associate
Production Pipelines
1. Benefits of Using Multiple Tasks in Databricks Jobs
What are Multi-Task Jobs?
Jobs in Databricks can consist of multiple tasks that run in a specified order, with dependencies between them. This creates a workflow pipeline.
Key Benefits
-
Modularity
- Break complex workflows into smaller, manageable tasks (e.g.,
ingest β transform β analyze
). - Easier debugging and maintenance.
- Break complex workflows into smaller, manageable tasks (e.g.,
-
Parallel Execution
- Independent tasks can run in parallel (e.g., processing different datasets simultaneously).
-
Conditional Execution
- Tasks can depend on the success/failure of previous tasks.
-
Reusability
- The same task can be reused across multiple jobs.
-
Resource Optimization
- Assign different clusters to different tasks based on workload needs.
Example Workflow
- If
Task 1
fails, downstream tasks (Task 2
,Task 3
) are skipped.
2. Setting Up a Predecessor Task in Jobs
What is a Predecessor Task?
A task that must complete before another task (successor) can run.
How to Set Up
-
In Databricks Jobs UI:
- Create a new job with multiple tasks.
- In the task settings, select βDepends onβ and choose the predecessor task.
-
Using Jobs API:
Example Scenario
- Task 1 (
ingest_data
): Loads raw data. - Task 2 (
transform_data
): Cleans and processes data (depends oningest_data
).
3. When to Use Predecessor Tasks
Common Scenarios
-
Data Dependency
- A task requires output from a previous task (e.g., raw data must be ingested before transformation).
-
Error Handling
- If an early task fails, downstream tasks should not execute (e.g., avoid processing incomplete data).
-
Cost Optimization
- Skip expensive computations if upstream validation fails.
Example
- If
validate_input
fails, the pipeline stops early.
4. Reviewing a Taskβs Execution History
Why Review Execution History?
- Debug failures.
- Monitor performance (duration, resource usage).
- Audit job runs.
How to Access
-
Databricks UI:
- Navigate to βJobsβ β Select job β βRunsβ tab.
- Click on a run to see task history.
-
Key Details Available:
- Start/end time.
- Status (Success, Failed, Skipped).
- Logs (stdout, stderr).
- Cluster metrics (CPU, memory).
Example Debugging Flow
- Find failed run β Check logs.
- Identify error (e.g.,
FileNotFound
). - Fix issue (e.g., correct input path).
5. CRON Scheduling for Jobs
What is CRON?
A time-based job scheduler in Unix systems. Databricks supports CRON expressions for scheduling jobs.
Syntax
Examples
Schedule | CRON Expression |
---|---|
Daily at 2 AM | 0 2 * * * |
Every Monday | 0 0 * * 1 |
Every 15 mins | */15 * * * * |
How to Set Up
- In Job settings β βScheduleβ β βCron Scheduleβ.
- Enter expression (e.g.,
0 0 * * *
for daily midnight runs).
6. Debugging a Failed Task
Steps to Debug
-
Check Run Logs:
- Navigate to the failed run β βLogsβ tab.
- Look for errors (e.g.,
Exception: File not found
).
-
Reproduce Locally:
- Run the notebook interactively with the same inputs.
-
Common Issues:
- Missing data/files.
- Permission errors.
- Syntax errors in code.
Example Fix
- Error:
AnalysisException: Table not found
. - Solution: Correct table name or ensure table exists.
7. Setting Up a Retry Policy
Why Retry?
- Handle transient failures (e.g., network issues).
- Avoid manual intervention.
Configuration Options
- Number of Retries: Max attempts (default: 0).
- Retry Delay: Wait time between retries (e.g., 5 mins).
How to Set Up
-
UI:
- In task settings β βRetry Policyβ β Set max retries and delay.
-
API:
Example
- Task fails due to temporary API outage β Retries 3x with 5-minute gaps.
8. Creating Alerts for Failed Tasks
Why Alert?
- Get notified immediately when a job fails.
- Reduce downtime.
Alert Options
- Email Notifications:
- Send alerts to individuals or groups.
- Webhooks:
- Integrate with Slack, PagerDuty, etc.
How to Set Up
-
UI:
- Navigate to Jobs β Select job β βAlertsβ tab.
- Add email/webhook.
-
API:
Example
- Job fails β Email sent to
team@company.com
.
9. Email Alerts for Failed Tasks
How It Works
- Databricks sends an email to specified addresses when:
- A task fails.
- The entire job fails.
Configuration
-
UI:
- Job settings β βNotificationsβ β Add email.
-
Limitations:
- Only supports email (for advanced integrations, use webhooks).
Example Email Content
Summary Table: Key Concepts
Topic | Key Takeaway |
---|---|
Multi-Task Jobs | Break workflows into modular, parallelizable tasks with dependencies. |
Predecessor Tasks | Ensure tasks run in order (e.g., ingest β transform). |
CRON Scheduling | Use expressions like 0 0 * * * for daily runs. |
Retry Policies | Configure retries (e.g., 3 attempts) for transient failures. |
Alerts | Notify via email/webhook when jobs fail. |