Reference
Spark: when function
The when
command in Spark is used to apply conditional logic to DataFrame columns. It is often used in conjunction with otherwise
to handle cases where the condition is not met. This is similar to the IF-ELSE
or CASE-WHEN
logic in SQL.
1. Syntax
PySpark:
Spark SQL:
2. Parameters
- condition: A boolean expression that determines when the
value
should be applied. - value: The value to assign if the condition is
True
. - otherwise(default_value): The value to assign if the condition is
False
.
3. Return Type
- Returns a new column with values based on the conditional logic.
4. Examples
Example 1: Simple Conditional Logic
PySpark:
Spark SQL:
Output:
Example 2: Multiple Conditions
PySpark:
Spark SQL:
Output:
Example 3: Nested Conditions
PySpark:
Spark SQL:
Output:
Example 4: Using when
with Other Functions
PySpark:
Spark SQL:
Output:
Example 5: Handling Null Values
PySpark:
Spark SQL:
Output:
Example 6: Combining Multiple Conditions
PySpark:
Spark SQL:
Output:
5. Common Use Cases
- Creating categorical variables for machine learning models.
- Applying business rules to data (e.g., discounts, statuses).
- Handling missing or invalid data by assigning default values.
6. Performance Considerations
- Avoid overly complex nested conditions, as they can impact performance.
- Use
when
in combination with other functions (e.g.,concat
,lit
) for advanced transformations.
7. Key Takeaways
- Purpose: The
when
command is used to apply conditional logic to DataFrame columns, similar toIF-ELSE
orCASE-WHEN
in SQL. - It can handle multiple conditions and nested logic.
- Always use
otherwise
to handle cases where none of the conditions are met. - In Spark SQL, similar logic can be achieved using
CASE-WHEN
statements.