when
command in Spark is used to apply conditional logic to DataFrame columns. It is often used in conjunction with otherwise
to handle cases where the condition is not met. This is similar to the IF-ELSE
or CASE-WHEN
logic in SQL.
1. Syntax
PySpark:2. Parameters
- condition: A boolean expression that determines when the
value
should be applied. - value: The value to assign if the condition is
True
. - otherwise(default_value): The value to assign if the condition is
False
.
3. Return Type
- Returns a new column with values based on the conditional logic.
4. Examples
Example 1: Simple Conditional Logic
PySpark:Example 2: Multiple Conditions
PySpark:Example 3: Nested Conditions
PySpark:Example 4: Using when
with Other Functions
PySpark:
Example 5: Handling Null Values
PySpark:Example 6: Combining Multiple Conditions
PySpark:5. Common Use Cases
- Creating categorical variables for machine learning models.
- Applying business rules to data (e.g., discounts, statuses).
- Handling missing or invalid data by assigning default values.
6. Performance Considerations
- Avoid overly complex nested conditions, as they can impact performance.
- Use
when
in combination with other functions (e.g.,concat
,lit
) for advanced transformations.
7. Key Takeaways
- Purpose: The
when
command is used to apply conditional logic to DataFrame columns, similar toIF-ELSE
orCASE-WHEN
in SQL. - It can handle multiple conditions and nested logic.
- Always use
otherwise
to handle cases where none of the conditions are met. - In Spark SQL, similar logic can be achieved using
CASE-WHEN
statements.