The when command in Spark is used to apply conditional logic to DataFrame columns. It is often used in conjunction with otherwise to handle cases where the condition is not met. This is similar to the IF-ELSE or CASE-WHEN logic in SQL.

1. Syntax

PySpark:

from pyspark.sql.functions import when

df.withColumn("new_column", when(condition, value).otherwise(default_value))

Spark SQL:

SELECT CASE 
           WHEN condition THEN value 
           ELSE default_value 
       END AS new_column 
FROM table_name;

2. Parameters

  • condition: A boolean expression that determines when the value should be applied.
  • value: The value to assign if the condition is True.
  • otherwise(default_value): The value to assign if the condition is False.

3. Return Type

  • Returns a new column with values based on the conditional logic.

4. Examples

Example 1: Simple Conditional Logic

PySpark:

from pyspark.sql import SparkSession
from pyspark.sql.functions import when, col

spark = SparkSession.builder.appName("WhenExample").getOrCreate()

data = [("Anand", 25), ("Bala", 30), ("Kavitha", 28), ("Raj", 35)]
columns = ["Name", "Age"]

df = spark.createDataFrame(data, columns)

# Add a new column 'Status' based on age
df_with_status = df.withColumn("Status", 
                               when(col("Age") < 30, "Young")
                               .otherwise("Adult"))
df_with_status.show()

Spark SQL:

SELECT Name, Age,
       CASE 
           WHEN Age < 30 THEN 'Young'
           ELSE 'Adult'
       END AS Status
FROM people;

Output:

+-------+---+------+
|   Name|Age|Status|
+-------+---+------+
|  Anand| 25| Young|
|   Bala| 30| Adult|
|Kavitha| 28| Young|
|    Raj| 35| Adult|
+-------+---+------+

Example 2: Multiple Conditions

PySpark:

# Add a new column 'AgeGroup' with multiple conditions
df_with_age_group = df.withColumn("AgeGroup", 
                                  when(col("Age") < 25, "Young")
                                  .when((col("Age") >= 25) & (col("Age") < 35), "Middle-aged")
                                  .otherwise("Senior"))
df_with_age_group.show()

Spark SQL:

SELECT Name, Age,
       CASE 
           WHEN Age < 25 THEN 'Young'
           WHEN Age >= 25 AND Age < 35 THEN 'Middle-aged'
           ELSE 'Senior'
       END AS AgeGroup
FROM people;

Output:

+-------+---+------------+
|   Name|Age|    AgeGroup|
+-------+---+------------+
|  Anand| 25|Middle-aged|
|   Bala| 30|Middle-aged|
|Kavitha| 28|Middle-aged|
|    Raj| 35|      Senior|
+-------+---+------------+

Example 3: Nested Conditions

PySpark:

# Add a new column 'Category' with nested conditions
df_with_category = df.withColumn("Category", 
                                when(col("Age") < 25, "Young")
                                .when((col("Age") >= 25) & (col("Age") < 35), 
                                    when(col("Name").startswith("K"), "Middle-aged-K")
                                    .otherwise("Middle-aged"))
                                .otherwise("Senior"))
df_with_category.show()

Spark SQL:

SELECT Name, Age,
       CASE 
           WHEN Age < 25 THEN 'Young'
           WHEN Age >= 25 AND Age < 35 THEN 
               CASE 
                   WHEN Name LIKE 'K%' THEN 'Middle-aged-K'
                   ELSE 'Middle-aged'
               END
           ELSE 'Senior'
       END AS Category
FROM people;

Output:

+-------+---+---------------+
|   Name|Age|       Category|
+-------+---+---------------+
|  Anand| 25|   Middle-aged|
|   Bala| 30|   Middle-aged|
|Kavitha| 28| Middle-aged-K|
|    Raj| 35|         Senior|
+-------+---+---------------+

Example 4: Using when with Other Functions

PySpark:

from pyspark.sql.functions import concat, lit

# Add a new column 'Description' using `when` and `concat`
df_with_description = df.withColumn("Description", 
                                    when(col("Age") < 30, concat(col("Name"), lit(" is young")))
                                    .otherwise(concat(col("Name"), lit(" is an adult"))))
df_with_description.show()

Spark SQL:

SELECT Name, Age,
       CASE 
           WHEN Age < 30 THEN CONCAT(Name, ' is young')
           ELSE CONCAT(Name, ' is an adult')
       END AS Description
FROM people;

Output:

+-------+---+-------------------+
|   Name|Age|        Description|
+-------+---+-------------------+
|  Anand| 25|  Anand is young|
|   Bala| 30|   Bala is an adult|
|Kavitha| 28|Kavitha is young|
|    Raj| 35|    Raj is an adult|
+-------+---+-------------------+

Example 5: Handling Null Values

PySpark:

from pyspark.sql.functions import when, col, lit

data = [("Anand", 25), ("Bala", None), ("Kavitha", 28), ("Raj", 35)]
columns = ["Name", "Age"]

df = spark.createDataFrame(data, columns)

# Replace null values in 'Age' with a default value
df_with_default_age = df.withColumn("Age", 
                                    when(col("Age").isNull(), 0)
                                    .otherwise(col("Age")))
df_with_default_age.show()

Spark SQL:

SELECT Name, 
       CASE 
           WHEN Age IS NULL THEN 0
           ELSE Age
       END AS Age
FROM people;

Output:

+-------+---+
|   Name|Age|
+-------+---+
|  Anand| 25|
|   Bala|  0|
|Kavitha| 28|
|    Raj| 35|
+-------+---+

Example 6: Combining Multiple Conditions

PySpark:

# Add a new column 'Eligibility' based on multiple conditions
df_with_eligibility = df.withColumn("Eligibility", 
                                    when((col("Age") >= 18) & (col("Age") <= 60), "Eligible")
                                    .otherwise("Not Eligible"))
df_with_eligibility.show()

Spark SQL:

SELECT Name, Age,
       CASE 
           WHEN Age >= 18 AND Age <= 60 THEN 'Eligible'
           ELSE 'Not Eligible'
       END AS Eligibility
FROM people;

Output:

+-------+---+-----------+
|   Name|Age|Eligibility|
+-------+---+-----------+
|  Anand| 25|   Eligible|
|   Bala| 30|   Eligible|
|Kavitha| 28|   Eligible|
|    Raj| 35|   Eligible|
+-------+---+-----------+

5. Common Use Cases

  • Creating categorical variables for machine learning models.
  • Applying business rules to data (e.g., discounts, statuses).
  • Handling missing or invalid data by assigning default values.

6. Performance Considerations

  • Avoid overly complex nested conditions, as they can impact performance.
  • Use when in combination with other functions (e.g., concat, lit) for advanced transformations.

7. Key Takeaways

  1. Purpose: The when command is used to apply conditional logic to DataFrame columns, similar to IF-ELSE or CASE-WHEN in SQL.
  2. It can handle multiple conditions and nested logic.
  3. Always use otherwise to handle cases where none of the conditions are met.
  4. In Spark SQL, similar logic can be achieved using CASE-WHEN statements.