Documentation Index
Fetch the complete documentation index at: https://rajanand.org/llms.txt
Use this file to discover all available pages before exploring further.
The when command in Spark is used to apply conditional logic to DataFrame columns. It is often used in conjunction with otherwise to handle cases where the condition is not met. This is similar to the IF-ELSE or CASE-WHEN logic in SQL.
1. Syntax
PySpark:
from pyspark.sql.functions import when
df.withColumn("new_column", when(condition, value).otherwise(default_value))
Spark SQL:
SELECT CASE
WHEN condition THEN value
ELSE default_value
END AS new_column
FROM table_name;
2. Parameters
- condition: A boolean expression that determines when the
value should be applied.
- value: The value to assign if the condition is
True.
- otherwise(default_value): The value to assign if the condition is
False.
3. Return Type
- Returns a new column with values based on the conditional logic.
4. Examples
Example 1: Simple Conditional Logic
PySpark:
from pyspark.sql import SparkSession
from pyspark.sql.functions import when, col
spark = SparkSession.builder.appName("WhenExample").getOrCreate()
data = [("Anand", 25), ("Bala", 30), ("Kavitha", 28), ("Raj", 35)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)
# Add a new column 'Status' based on age
df_with_status = df.withColumn("Status",
when(col("Age") < 30, "Young")
.otherwise("Adult"))
df_with_status.show()
Spark SQL:
SELECT Name, Age,
CASE
WHEN Age < 30 THEN 'Young'
ELSE 'Adult'
END AS Status
FROM people;
Output:
+-------+---+------+
| Name|Age|Status|
+-------+---+------+
| Anand| 25| Young|
| Bala| 30| Adult|
|Kavitha| 28| Young|
| Raj| 35| Adult|
+-------+---+------+
Example 2: Multiple Conditions
PySpark:
# Add a new column 'AgeGroup' with multiple conditions
df_with_age_group = df.withColumn("AgeGroup",
when(col("Age") < 25, "Young")
.when((col("Age") >= 25) & (col("Age") < 35), "Middle-aged")
.otherwise("Senior"))
df_with_age_group.show()
Spark SQL:
SELECT Name, Age,
CASE
WHEN Age < 25 THEN 'Young'
WHEN Age >= 25 AND Age < 35 THEN 'Middle-aged'
ELSE 'Senior'
END AS AgeGroup
FROM people;
Output:
+-------+---+------------+
| Name|Age| AgeGroup|
+-------+---+------------+
| Anand| 25|Middle-aged|
| Bala| 30|Middle-aged|
|Kavitha| 28|Middle-aged|
| Raj| 35| Senior|
+-------+---+------------+
Example 3: Nested Conditions
PySpark:
# Add a new column 'Category' with nested conditions
df_with_category = df.withColumn("Category",
when(col("Age") < 25, "Young")
.when((col("Age") >= 25) & (col("Age") < 35),
when(col("Name").startswith("K"), "Middle-aged-K")
.otherwise("Middle-aged"))
.otherwise("Senior"))
df_with_category.show()
Spark SQL:
SELECT Name, Age,
CASE
WHEN Age < 25 THEN 'Young'
WHEN Age >= 25 AND Age < 35 THEN
CASE
WHEN Name LIKE 'K%' THEN 'Middle-aged-K'
ELSE 'Middle-aged'
END
ELSE 'Senior'
END AS Category
FROM people;
Output:
+-------+---+---------------+
| Name|Age| Category|
+-------+---+---------------+
| Anand| 25| Middle-aged|
| Bala| 30| Middle-aged|
|Kavitha| 28| Middle-aged-K|
| Raj| 35| Senior|
+-------+---+---------------+
Example 4: Using when with Other Functions
PySpark:
from pyspark.sql.functions import concat, lit
# Add a new column 'Description' using `when` and `concat`
df_with_description = df.withColumn("Description",
when(col("Age") < 30, concat(col("Name"), lit(" is young")))
.otherwise(concat(col("Name"), lit(" is an adult"))))
df_with_description.show()
Spark SQL:
SELECT Name, Age,
CASE
WHEN Age < 30 THEN CONCAT(Name, ' is young')
ELSE CONCAT(Name, ' is an adult')
END AS Description
FROM people;
Output:
+-------+---+-------------------+
| Name|Age| Description|
+-------+---+-------------------+
| Anand| 25| Anand is young|
| Bala| 30| Bala is an adult|
|Kavitha| 28|Kavitha is young|
| Raj| 35| Raj is an adult|
+-------+---+-------------------+
Example 5: Handling Null Values
PySpark:
from pyspark.sql.functions import when, col, lit
data = [("Anand", 25), ("Bala", None), ("Kavitha", 28), ("Raj", 35)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)
# Replace null values in 'Age' with a default value
df_with_default_age = df.withColumn("Age",
when(col("Age").isNull(), 0)
.otherwise(col("Age")))
df_with_default_age.show()
Spark SQL:
SELECT Name,
CASE
WHEN Age IS NULL THEN 0
ELSE Age
END AS Age
FROM people;
Output:
+-------+---+
| Name|Age|
+-------+---+
| Anand| 25|
| Bala| 0|
|Kavitha| 28|
| Raj| 35|
+-------+---+
Example 6: Combining Multiple Conditions
PySpark:
# Add a new column 'Eligibility' based on multiple conditions
df_with_eligibility = df.withColumn("Eligibility",
when((col("Age") >= 18) & (col("Age") <= 60), "Eligible")
.otherwise("Not Eligible"))
df_with_eligibility.show()
Spark SQL:
SELECT Name, Age,
CASE
WHEN Age >= 18 AND Age <= 60 THEN 'Eligible'
ELSE 'Not Eligible'
END AS Eligibility
FROM people;
Output:
+-------+---+-----------+
| Name|Age|Eligibility|
+-------+---+-----------+
| Anand| 25| Eligible|
| Bala| 30| Eligible|
|Kavitha| 28| Eligible|
| Raj| 35| Eligible|
+-------+---+-----------+
5. Common Use Cases
- Creating categorical variables for machine learning models.
- Applying business rules to data (e.g., discounts, statuses).
- Handling missing or invalid data by assigning default values.
- Avoid overly complex nested conditions, as they can impact performance.
- Use
when in combination with other functions (e.g., concat, lit) for advanced transformations.
7. Key Takeaways
- Purpose: The
when command is used to apply conditional logic to DataFrame columns, similar to IF-ELSE or CASE-WHEN in SQL.
- It can handle multiple conditions and nested logic.
- Always use
otherwise to handle cases where none of the conditions are met.
- In Spark SQL, similar logic can be achieved using
CASE-WHEN statements.