when command in Spark is used to apply conditional logic to DataFrame columns. It is often used in conjunction with otherwise to handle cases where the condition is not met. This is similar to the IF-ELSE or CASE-WHEN logic in SQL.
1. Syntax
PySpark:2. Parameters
- condition: A boolean expression that determines when the
valueshould be applied. - value: The value to assign if the condition is
True. - otherwise(default_value): The value to assign if the condition is
False.
3. Return Type
- Returns a new column with values based on the conditional logic.
4. Examples
Example 1: Simple Conditional Logic
PySpark:Example 2: Multiple Conditions
PySpark:Example 3: Nested Conditions
PySpark:Example 4: Using when with Other Functions
PySpark:
Example 5: Handling Null Values
PySpark:Example 6: Combining Multiple Conditions
PySpark:5. Common Use Cases
- Creating categorical variables for machine learning models.
- Applying business rules to data (e.g., discounts, statuses).
- Handling missing or invalid data by assigning default values.
6. Performance Considerations
- Avoid overly complex nested conditions, as they can impact performance.
- Use
whenin combination with other functions (e.g.,concat,lit) for advanced transformations.
7. Key Takeaways
- Purpose: The
whencommand is used to apply conditional logic to DataFrame columns, similar toIF-ELSEorCASE-WHENin SQL. - It can handle multiple conditions and nested logic.
- Always use
otherwiseto handle cases where none of the conditions are met. - In Spark SQL, similar logic can be achieved using
CASE-WHENstatements.