The filter() or where() command in Spark is used to filter rows from a DataFrame based on a specified condition. Both filter() and where() are interchangeable and can be used to achieve the same result. The primary purpose of these commands is to select a subset of rows that meet a given condition.


1. Syntax

df.filter(condition)
df.where(condition)

2. Parameters

  • condition: A boolean expression that specifies the filtering condition. Rows that satisfy this condition will be included in the output.

3. Return Type

  • Returns a new DataFrame containing only the rows that satisfy the given condition.

4. Examples

Example 1: Filtering Rows Based on a Single Condition

PySpark:

PySpark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("FilterExample").getOrCreate()

data = [("Anand", 34), ("Bala", 45), ("Kavitha", 29)]
columns = ["Name", "Age"]

df = spark.createDataFrame(data, columns)

# Using filter()
filtered_df = df.filter(df.Age > 30)
filtered_df.show()

# Using where()
filtered_df = df.where(df.Age > 30)
filtered_df.show()

Spark SQL:

CREATE TABLE people (Name STRING, Age INT);

INSERT INTO people VALUES
("Anand", 34),
("Bala", 45),
("Kavitha", 29);

SELECT * FROM people WHERE Age > 30;

Output:

+-----+---+
| Name|Age|
+-----+---+
|Anand| 34|
| Bala| 45|
+-----+---+

Example 2: Filtering Rows Based on Multiple Conditions

PySpark:

# Using filter() with multiple conditions
filtered_df = df.filter((df.Age > 30) & (df.Name == "Anand"))
filtered_df.show()

# Using where() with multiple conditions
filtered_df = df.where((df.Age > 30) & (df.Name == "Anand"))
filtered_df.show()

Spark SQL:

SELECT * FROM people WHERE Age > 30 AND Name = 'Anand';

Output:

+-----+---+
| Name|Age|
+-----+---+
|Anand| 34|
+-----+---+

Example 3: Filtering Rows Using SQL-like Syntax

PySpark:

# Using filter() with SQL-like syntax
filtered_df = df.filter("Age > 30 AND Name = 'Anand'")
filtered_df.show()

# Using where() with SQL-like syntax
filtered_df = df.where("Age > 30 AND Name = 'Anand'")
filtered_df.show()

Spark SQL:

SELECT * FROM people WHERE Age > 30 AND Name = 'Anand';

Output:

+-----+---+
| Name|Age|
+-----+---+
|Anand| 34|
+-----+---+

Example 4: Filtering Rows Using String Functions

PySpark:

from pyspark.sql.functions import col

# Filter rows where the name starts with 'A'
filtered_df = df.filter(col("Name").startswith("A"))
filtered_df.show()

Spark SQL:

SELECT * FROM people WHERE Name LIKE 'A%';

Output:

+-----+---+
| Name|Age|
+-----+---+
|Anand| 34|
+-----+---+

Example 5: Filtering Rows with Null Values

PySpark:

from pyspark.sql.functions import col

data = [("Anand", 34), ("Bala", None), ("Kavitha", 29)]
columns = ["Name", "Age"]

df = spark.createDataFrame(data, columns)

# Filter rows where Age is not null
filtered_df = df.filter(col("Age").isNotNull())
filtered_df.show()

Spark SQL:

SELECT * FROM people WHERE Age IS NOT NULL;

Output:

+-----+---+
| Name|Age|
+-----+---+
|Anand| 34|
|Kavitha| 29|
+-----+---+

Example 6: Filtering Rows Using Regular Expressions

PySpark:

from pyspark.sql.functions import col

# Filter rows where the name contains 'a'
filtered_df = df.filter(col("Name").rlike("a"))
filtered_df.show()

Spark SQL:

SELECT * FROM people WHERE Name RLIKE 'a';

Output:

+-----+---+
| Name|Age|
+-----+---+
|Anand| 34|
|Kavitha| 29|
+-----+---+

5. Common Use Cases

  • Filtering data based on specific criteria (e.g., age, salary, etc.).
  • Removing rows with null or unwanted values.
  • Selecting a subset of data for further analysis or processing.

6. Performance Considerations

  • Filtering early in the data processing pipeline can significantly reduce the amount of data that needs to be processed in subsequent steps, leading to better performance.
  • Use appropriate indexing and partitioning strategies to optimize filter operations on large datasets.

7. Key Takeaways

  • The filter() and where() commands are essential for data manipulation in Spark, allowing you to select specific rows based on conditions.
  • These commands are highly flexible and can be used with a variety of conditions, including simple comparisons, logical operations, and SQL-like expressions.
  • Both filter() and where() are used to filter rows based on a condition.
  • The condition can be a simple comparison, a combination of conditions using logical operators (&, |, ~), or even SQL-like expressions.
  • The result is a new DataFrame containing only the rows that satisfy the condition.
  • You can use column objects, column names, or SQL-like strings to specify the condition.