Reference
Spark: dropDuplicates function
The dropDuplicates()
command in Spark is used to remove duplicate rows from a DataFrame. It is similar to the distinct()
command but provides more flexibility by allowing you to specify a subset of columns to consider when identifying duplicates. This is particularly useful when you want to remove duplicates based on specific columns rather than the entire row.
1. Syntax
PySpark:
Spark SQL:
2. Parameters
- subset: A list of column names (as strings) to consider when identifying duplicates. If
None
, all columns are considered.
3. Return Type
- Returns a new DataFrame with duplicate rows removed.
4. Examples
Example 1: Removing Duplicate Rows from a DataFrame
PySpark:
Spark SQL:
Output:
Example 2: Removing Duplicates Based on a Subset of Columns
PySpark:
Spark SQL:
Output:
Example 3: Removing Duplicates with Null Values
PySpark:
Spark SQL:
Output:
Example 4: Removing Duplicates Based on a Single Column
PySpark:
Spark SQL:
Output:
Example 5: Removing Duplicates with Complex Data
PySpark:
Spark SQL:
Output:
Example 6: Removing Duplicates with Null Values in Subset Columns
PySpark:
Spark SQL:
Output:
Example 7: Removing Duplicates with Custom Logic
PySpark:
Spark SQL:
Output:
5. Common Use Cases
- Removing duplicate records from transactional data (e.g., logs, events).
- Ensuring uniqueness in master data (e.g., customer, product data).
- Preparing data for machine learning by removing redundant samples.
6. Performance Considerations
- Use
dropDuplicates()
judiciously on large datasets, as it involves shuffling and sorting. - Specify a subset of columns to reduce the number of comparisons and improve performance.
- Use proper partitioning and indexing to optimize duplicate removal operations.
7. Key Takeaways
- The
dropDuplicates()
command is used to remove duplicate rows from a DataFrame. - It allows you to specify a subset of columns to consider when identifying duplicates.
- Removing duplicates can be resource-intensive for large datasets, as it involves shuffling and sorting.
- In Spark SQL, similar functionality can be achieved using
SELECT DISTINCT
. - Works efficiently on large datasets when combined with proper partitioning and caching.