Spark: distinct() function
The distinct()
command in Spark is used to remove duplicate rows from a DataFrame. It returns a new DataFrame containing only the unique rows based on all columns or a subset of columns. This is particularly useful when you need to eliminate redundant data and ensure that each row in your dataset is unique.
1. Syntax
PySpark:
Spark SQL:
2. Parameters
- The
distinct()
method does not take any parameters. It operates on the entire DataFrame by default.
3. Return Type
- Returns a new DataFrame with duplicate rows removed.
4. Examples
Example 1: Removing Duplicate Rows from a DataFrame
PySpark:
Spark SQL:
Output:
Example 2: Removing Duplicates Based on a Subset of Columns
PySpark:
Spark SQL:
Output:
Example 3: Counting Distinct Rows
PySpark:
Spark SQL:
Output:
Example 4: Removing Duplicates with Null Values
PySpark:
Spark SQL:
Output:
Example 5: Using dropDuplicates()
for Subset of Columns
PySpark:
Spark SQL:
Output:
Example 6: Counting Distinct Values in a Column
PySpark:
Spark SQL:
Output:
Example 7: Removing Duplicates with Complex Data
PySpark:
Spark SQL:
Output:
5. Common Use Cases
- Cleaning datasets by removing duplicate records.
- Ensuring data integrity by enforcing uniqueness.
- Preparing data for aggregation or analysis.
6. Performance Considerations
- Use
distinct()
judiciously on large datasets, as it involves shuffling and sorting, which can be expensive. - Consider using
dropDuplicates()
if you only need to remove duplicates based on a subset of columns, as it can be more efficient.
7. Key Takeaways
- The
distinct()
command is used to remove duplicate rows from a DataFrame. - It can be applied to the entire DataFrame or a subset of columns.
- In Spark SQL, similar functionality can be achieved using
SELECT DISTINCT
. - Works efficiently on large datasets when combined with proper partitioning and caching.