Spark: filter or where function

The filter() or where() command in Spark is used to filter rows from a DataFrame based on a specified condition. Both filter() and where() are interchangeable and can be used to achieve the same result. The primary purpose of these commands is to select a subset of rows that meet a given condition.

1. Syntax

df.filter(condition)
df.where(condition)

2. Parameters

condition: A boolean expression that specifies the filtering condition. Rows that satisfy this condition will be included in the output.

3. Return Type

Returns a new DataFrame containing only the rows that satisfy the given condition.

4. Examples

Different ways to filter rows in PySpark DataFrame:

from pyspark.sql.functions import col, column

df1 = df.filter(df.Age > 30)  #using dot notation
df2 = df.filter(df["Age"] > 30)  #using bracket notation
df3 = df.filter("Age > 30")  #using SQL-like syntax
df4 = df.filter(col("Age") > 30)  #using column object with col function
df5 = df.filter(column("Age") > 30)  #using column object with column function
#you can replace filter function with where function as where is an alias for filter

Example 1: Filtering Rows Based on a Single Condition

PySpark:

PySpark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("FilterExample").getOrCreate()

data = [("Anand", 34), ("Bala", 45), ("Kavitha", 29)]
columns = ["Name", "Age"]

df = spark.createDataFrame(data, columns)

# Using filter()
filtered_df = df.filter(df.Age > 30)
filtered_df.show()

# Using where()
filtered_df = df.where(df.Age > 30)
filtered_df.show()

Spark SQL:

CREATE TABLE people (Name STRING, Age INT);

INSERT INTO people VALUES
("Anand", 34),
("Bala", 45),
("Kavitha", 29);

SELECT * FROM people WHERE Age > 30;

Output:

+-----+---+
| Name|Age|
+-----+---+
|Anand| 34|
| Bala| 45|
+-----+---+

Example 2: Filtering Rows Based on Multiple Conditions

PySpark:

# Using filter() with multiple conditions
filtered_df = df.filter((df.Age > 30) & (df.Name == "Anand"))
filtered_df.show()

# Using where() with multiple conditions
filtered_df = df.where((df.Age > 30) & (df.Name == "Anand"))
filtered_df.show()

Spark SQL:

SELECT * FROM people WHERE Age > 30 AND Name = 'Anand';

Output:

+-----+---+
| Name|Age|
+-----+---+
|Anand| 34|
+-----+---+

Example 3: Filtering Rows Using SQL-like Syntax

PySpark:

# Using filter() with SQL-like syntax
filtered_df = df.filter("Age > 30 AND Name = 'Anand'")
filtered_df.show()

# Using where() with SQL-like syntax
filtered_df = df.where("Age > 30 AND Name = 'Anand'")
filtered_df.show()

Spark SQL:

SELECT * FROM people WHERE Age > 30 AND Name = 'Anand';

Output:

+-----+---+
| Name|Age|
+-----+---+
|Anand| 34|
+-----+---+

Example 4: Filtering Rows Using String Functions

PySpark:

from pyspark.sql.functions import col

# Filter rows where the name starts with 'A'
filtered_df = df.filter(col("Name").startswith("A"))
filtered_df.show()

Spark SQL:

SELECT * FROM people WHERE Name LIKE 'A%';

Output:

+-----+---+
| Name|Age|
+-----+---+
|Anand| 34|
+-----+---+

Example 5: Filtering Rows with Null Values

PySpark:

from pyspark.sql.functions import col

data = [("Anand", 34), ("Bala", None), ("Kavitha", 29)]
columns = ["Name", "Age"]

df = spark.createDataFrame(data, columns)

# Filter rows where Age is not null
filtered_df = df.filter(col("Age").isNotNull())
filtered_df.show()

Spark SQL:

SELECT * FROM people WHERE Age IS NOT NULL;

Output:

+-----+---+
| Name|Age|
+-----+---+
|Anand| 34|
|Kavitha| 29|
+-----+---+

Example 6: Filtering Rows Using Regular Expressions

PySpark:

from pyspark.sql.functions import col

# Filter rows where the name contains 'a'
filtered_df = df.filter(col("Name").rlike("a"))
filtered_df.show()

Spark SQL:

SELECT * FROM people WHERE Name RLIKE 'a';

Output:

+-----+---+
| Name|Age|
+-----+---+
|Anand| 34|
|Kavitha| 29|
+-----+---+

5. Common Use Cases

Filtering data based on specific criteria (e.g., age, salary, etc.).
Removing rows with null or unwanted values.
Selecting a subset of data for further analysis or processing.

6. Performance Considerations

Filtering early in the data processing pipeline can significantly reduce the amount of data that needs to be processed in subsequent steps, leading to better performance.
Use appropriate indexing and partitioning strategies to optimize filter operations on large datasets.

7. Key Takeaways

The filter() and where() commands are essential for data manipulation in Spark, allowing you to select specific rows based on conditions.
These commands are highly flexible and can be used with a variety of conditions, including simple comparisons, logical operations, and SQL-like expressions.
Both filter() and where() are used to filter rows based on a condition.
The condition can be a simple comparison, a combination of conditions using logical operators (&, |, ~), or even SQL-like expressions.
The result is a new DataFrame containing only the rows that satisfy the condition.
You can use column objects, column names, or SQL-like strings to specify the condition.

​1. Syntax

​2. Parameters

​3. Return Type

​4. Examples

​Example 1: Filtering Rows Based on a Single Condition

​Example 2: Filtering Rows Based on Multiple Conditions

​Example 3: Filtering Rows Using SQL-like Syntax

​Example 4: Filtering Rows Using String Functions

​Example 5: Filtering Rows with Null Values

​Example 6: Filtering Rows Using Regular Expressions

​5. Common Use Cases

​6. Performance Considerations

​7. Key Takeaways

1. Syntax

2. Parameters

3. Return Type

4. Examples

Example 1: Filtering Rows Based on a Single Condition

Example 2: Filtering Rows Based on Multiple Conditions

Example 3: Filtering Rows Using SQL-like Syntax

Example 4: Filtering Rows Using String Functions

Example 5: Filtering Rows with Null Values

Example 6: Filtering Rows Using Regular Expressions

5. Common Use Cases

6. Performance Considerations

7. Key Takeaways