The dropDuplicates()
command in Spark is used to remove duplicate rows from a DataFrame. It is similar to the distinct()
command but provides more flexibility by allowing you to specify a subset of columns to consider when identifying duplicates. This is particularly useful when you want to remove duplicates based on specific columns rather than the entire row.
1. Syntax
PySpark:
df.dropDuplicates(subset=None)
Spark SQL:
SELECT DISTINCT col1, col2, ..., colN FROM table_name;
2. Parameters
- subset: A list of column names (as strings) to consider when identifying duplicates. If
None
, all columns are considered.
3. Return Type
- Returns a new DataFrame with duplicate rows removed.
4. Examples
Example 1: Removing Duplicate Rows from a DataFrame
PySpark:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("DropDuplicatesExample").getOrCreate()
# Create DataFrame
data = [("Anand", 25), ("Bala", 30), ("Kavitha", 28), ("Anand", 25)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)
# Remove duplicate rows
df_no_duplicates = df.dropDuplicates()
df_no_duplicates.show()
Spark SQL:
SELECT DISTINCT * FROM people;
Output:
+-------+---+
| Name|Age|
+-------+---+
| Anand| 25|
| Bala| 30|
|Kavitha| 28|
+-------+---+
Example 2: Removing Duplicates Based on a Subset of Columns
PySpark:
# Add a 'Gender' column
data = [("Anand", 25, "M"), ("Bala", 30, "F"), ("Kavitha", 28, "F"), ("Anand", 25, "M")]
columns = ["Name", "Age", "Gender"]
df = spark.createDataFrame(data, columns)
# Remove duplicates based on 'Name' and 'Age'
df_no_duplicates = df.dropDuplicates(["Name", "Age"])
df_no_duplicates.show()
Spark SQL:
SELECT DISTINCT Name, Age FROM people;
Output:
+-------+---+------+
| Name|Age|Gender|
+-------+---+------+
| Anand| 25| M|
| Bala| 30| F|
|Kavitha| 28| F|
+-------+---+------+
Example 3: Removing Duplicates with Null Values
PySpark:
# Add a row with a null value
data = [("Anand", 25), ("Bala", None), ("Kavitha", 28), ("Anand", 25), ("Bala", None)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)
# Remove duplicates (including rows with null values)
df_no_duplicates = df.dropDuplicates()
df_no_duplicates.show()
Spark SQL:
SELECT DISTINCT * FROM people;
Output:
+-------+----+
| Name| Age|
+-------+----+
| Anand| 25|
| Bala|null|
|Kavitha| 28|
+-------+----+
Example 4: Removing Duplicates Based on a Single Column
PySpark:
# Remove duplicates based on the 'Name' column
df_no_duplicates = df.dropDuplicates(["Name"])
df_no_duplicates.show()
Spark SQL:
SELECT DISTINCT Name FROM people;
Output:
+-------+
| Name|
+-------+
| Anand|
| Bala|
|Kavitha|
+-------+
Example 5: Removing Duplicates with Complex Data
PySpark:
# Create DataFrame with complex data
data = [("Anand", 25, "M", "Sales"),
("Bala", 30, "F", "HR"),
("Kavitha", 28, "F", "HR"),
("Anand", 25, "M", "Sales"),
("Bala", 30, "F", "HR")]
columns = ["Name", "Age", "Gender", "Department"]
df = spark.createDataFrame(data, columns)
# Remove duplicates based on all columns
df_no_duplicates = df.dropDuplicates()
df_no_duplicates.show()
Spark SQL:
SELECT DISTINCT * FROM employees;
Output:
+-------+---+------+----------+
| Name|Age|Gender|Department|
+-------+---+------+----------+
| Anand| 25| M| Sales|
| Bala| 30| F| HR|
|Kavitha| 28| F| HR|
+-------+---+------+----------+
Example 6: Removing Duplicates with Null Values in Subset Columns
PySpark:
# Add a row with a null value in the 'Age' column
data = [("Anand", 25, "M"), ("Bala", None, "F"), ("Kavitha", 28, "F"), ("Anand", 25, "M"), ("Bala", None, "F")]
columns = ["Name", "Age", "Gender"]
df = spark.createDataFrame(data, columns)
# Remove duplicates based on 'Name' and 'Gender'
df_no_duplicates = df.dropDuplicates(["Name", "Gender"])
df_no_duplicates.show()
Spark SQL:
SELECT DISTINCT Name, Gender FROM people;
Output:
+-------+----+------+
| Name| Age|Gender|
+-------+----+------+
| Anand| 25| M|
| Bala|null| F|
|Kavitha| 28| F|
+-------+----+------+
Example 7: Removing Duplicates with Custom Logic
PySpark:
from pyspark.sql.functions import col
# Remove duplicates based on a custom condition (e.g., 'Age' > 25)
df_no_duplicates = df.filter(col("Age") > 25).dropDuplicates(["Name"])
df_no_duplicates.show()
Spark SQL:
SELECT DISTINCT Name
FROM people
WHERE Age > 25;
Output:
+-------+
| Name|
+-------+
| Bala|
|Kavitha|
+-------+
5. Common Use Cases
- Removing duplicate records from transactional data (e.g., logs, events).
- Ensuring uniqueness in master data (e.g., customer, product data).
- Preparing data for machine learning by removing redundant samples.
- Use
dropDuplicates()
judiciously on large datasets, as it involves shuffling and sorting.
- Specify a subset of columns to reduce the number of comparisons and improve performance.
- Use proper partitioning and indexing to optimize duplicate removal operations.
7. Key Takeaways
- The
dropDuplicates()
command is used to remove duplicate rows from a DataFrame.
- It allows you to specify a subset of columns to consider when identifying duplicates.
- Removing duplicates can be resource-intensive for large datasets, as it involves shuffling and sorting.
- In Spark SQL, similar functionality can be achieved using
SELECT DISTINCT
.
- Works efficiently on large datasets when combined with proper partitioning and caching.