Documentation Index
Fetch the complete documentation index at: https://rajanand.org/llms.txt
Use this file to discover all available pages before exploring further.
The filter() or where() command in Spark is used to filter rows from a DataFrame based on a specified condition. Both filter() and where() are interchangeable and can be used to achieve the same result. The primary purpose of these commands is to select a subset of rows that meet a given condition.
1. Syntax
df.filter(condition)
df.where(condition)
2. Parameters
- condition: A boolean expression that specifies the filtering condition. Rows that satisfy this condition will be included in the output.
3. Return Type
- Returns a new DataFrame containing only the rows that satisfy the given condition.
4. Examples
Different ways to filter rows in PySpark DataFrame:
from pyspark.sql.functions import col, column
df1 = df.filter(df.Age > 30) #using dot notation
df2 = df.filter(df["Age"] > 30) #using bracket notation
df3 = df.filter("Age > 30") #using SQL-like syntax
df4 = df.filter(col("Age") > 30) #using column object with col function
df5 = df.filter(column("Age") > 30) #using column object with column function
#you can replace filter function with where function as where is an alias for filter
Example 1: Filtering Rows Based on a Single Condition
PySpark:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("FilterExample").getOrCreate()
data = [("Anand", 34), ("Bala", 45), ("Kavitha", 29)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)
# Using filter()
filtered_df = df.filter(df.Age > 30)
filtered_df.show()
# Using where()
filtered_df = df.where(df.Age > 30)
filtered_df.show()
Spark SQL:
CREATE TABLE people (Name STRING, Age INT);
INSERT INTO people VALUES
("Anand", 34),
("Bala", 45),
("Kavitha", 29);
SELECT * FROM people WHERE Age > 30;
Output:
+-----+---+
| Name|Age|
+-----+---+
|Anand| 34|
| Bala| 45|
+-----+---+
Example 2: Filtering Rows Based on Multiple Conditions
PySpark:
# Using filter() with multiple conditions
filtered_df = df.filter((df.Age > 30) & (df.Name == "Anand"))
filtered_df.show()
# Using where() with multiple conditions
filtered_df = df.where((df.Age > 30) & (df.Name == "Anand"))
filtered_df.show()
Spark SQL:
SELECT * FROM people WHERE Age > 30 AND Name = 'Anand';
Output:
+-----+---+
| Name|Age|
+-----+---+
|Anand| 34|
+-----+---+
Example 3: Filtering Rows Using SQL-like Syntax
PySpark:
# Using filter() with SQL-like syntax
filtered_df = df.filter("Age > 30 AND Name = 'Anand'")
filtered_df.show()
# Using where() with SQL-like syntax
filtered_df = df.where("Age > 30 AND Name = 'Anand'")
filtered_df.show()
Spark SQL:
SELECT * FROM people WHERE Age > 30 AND Name = 'Anand';
Output:
+-----+---+
| Name|Age|
+-----+---+
|Anand| 34|
+-----+---+
Example 4: Filtering Rows Using String Functions
PySpark:
from pyspark.sql.functions import col
# Filter rows where the name starts with 'A'
filtered_df = df.filter(col("Name").startswith("A"))
filtered_df.show()
Spark SQL:
SELECT * FROM people WHERE Name LIKE 'A%';
Output:
+-----+---+
| Name|Age|
+-----+---+
|Anand| 34|
+-----+---+
Example 5: Filtering Rows with Null Values
PySpark:
from pyspark.sql.functions import col
data = [("Anand", 34), ("Bala", None), ("Kavitha", 29)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)
# Filter rows where Age is not null
filtered_df = df.filter(col("Age").isNotNull())
filtered_df.show()
Spark SQL:
SELECT * FROM people WHERE Age IS NOT NULL;
Output:
+-----+---+
| Name|Age|
+-----+---+
|Anand| 34|
|Kavitha| 29|
+-----+---+
Example 6: Filtering Rows Using Regular Expressions
PySpark:
from pyspark.sql.functions import col
# Filter rows where the name contains 'a'
filtered_df = df.filter(col("Name").rlike("a"))
filtered_df.show()
Spark SQL:
SELECT * FROM people WHERE Name RLIKE 'a';
Output:
+-----+---+
| Name|Age|
+-----+---+
|Anand| 34|
|Kavitha| 29|
+-----+---+
5. Common Use Cases
- Filtering data based on specific criteria (e.g., age, salary, etc.).
- Removing rows with null or unwanted values.
- Selecting a subset of data for further analysis or processing.
- Filtering early in the data processing pipeline can significantly reduce the amount of data that needs to be processed in subsequent steps, leading to better performance.
- Use appropriate indexing and partitioning strategies to optimize filter operations on large datasets.
7. Key Takeaways
- The
filter() and where() commands are essential for data manipulation in Spark, allowing you to select specific rows based on conditions.
- These commands are highly flexible and can be used with a variety of conditions, including simple comparisons, logical operations, and SQL-like expressions.
- Both
filter() and where() are used to filter rows based on a condition.
- The condition can be a simple comparison, a combination of conditions using logical operators (
&, |, ~), or even SQL-like expressions.
- The result is a new DataFrame containing only the rows that satisfy the condition.
- You can use column objects, column names, or SQL-like strings to specify the condition.