Reference
Spark: filter or where function
The filter()
or where()
command in Spark is used to filter rows from a DataFrame based on a specified condition. Both filter()
and where()
are interchangeable and can be used to achieve the same result. The primary purpose of these commands is to select a subset of rows that meet a given condition.
1. Syntax
2. Parameters
- condition: A boolean expression that specifies the filtering condition. Rows that satisfy this condition will be included in the output.
3. Return Type
- Returns a new DataFrame containing only the rows that satisfy the given condition.
4. Examples
Example 1: Filtering Rows Based on a Single Condition
PySpark:
PySpark
Spark SQL:
Output:
Example 2: Filtering Rows Based on Multiple Conditions
PySpark:
Spark SQL:
Output:
Example 3: Filtering Rows Using SQL-like Syntax
PySpark:
Spark SQL:
Output:
Example 4: Filtering Rows Using String Functions
PySpark:
Spark SQL:
Output:
Example 5: Filtering Rows with Null Values
PySpark:
Spark SQL:
Output:
Example 6: Filtering Rows Using Regular Expressions
PySpark:
Spark SQL:
Output:
5. Common Use Cases
- Filtering data based on specific criteria (e.g., age, salary, etc.).
- Removing rows with null or unwanted values.
- Selecting a subset of data for further analysis or processing.
6. Performance Considerations
- Filtering early in the data processing pipeline can significantly reduce the amount of data that needs to be processed in subsequent steps, leading to better performance.
- Use appropriate indexing and partitioning strategies to optimize filter operations on large datasets.
7. Key Takeaways
- The
filter()
andwhere()
commands are essential for data manipulation in Spark, allowing you to select specific rows based on conditions. - These commands are highly flexible and can be used with a variety of conditions, including simple comparisons, logical operations, and SQL-like expressions.
- Both
filter()
andwhere()
are used to filter rows based on a condition. - The condition can be a simple comparison, a combination of conditions using logical operators (
&
,|
,~
), or even SQL-like expressions. - The result is a new DataFrame containing only the rows that satisfy the condition.
- You can use column objects, column names, or SQL-like strings to specify the condition.