Spark: dropDuplicates function

The dropDuplicates() command in Spark is used to remove duplicate rows from a DataFrame. It is similar to the distinct() command but provides more flexibility by allowing you to specify a subset of columns to consider when identifying duplicates. This is particularly useful when you want to remove duplicates based on specific columns rather than the entire row.

1. Syntax

PySpark:

df.dropDuplicates(subset=None)

Spark SQL:

SELECT DISTINCT col1, col2, ..., colN FROM table_name;

2. Parameters

subset: A list of column names (as strings) to consider when identifying duplicates. If None, all columns are considered.

3. Return Type

Returns a new DataFrame with duplicate rows removed.

4. Examples

Example 1: Removing Duplicate Rows from a DataFrame

PySpark:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("DropDuplicatesExample").getOrCreate()

# Create DataFrame
data = [("Anand", 25), ("Bala", 30), ("Kavitha", 28), ("Anand", 25)]
columns = ["Name", "Age"]

df = spark.createDataFrame(data, columns)

# Remove duplicate rows
df_no_duplicates = df.dropDuplicates()
df_no_duplicates.show()

Spark SQL:

SELECT DISTINCT * FROM people;

Output:

+-------+---+
|   Name|Age|
+-------+---+
|  Anand| 25|
|   Bala| 30|
|Kavitha| 28|
+-------+---+

Example 2: Removing Duplicates Based on a Subset of Columns

PySpark:

# Add a 'Gender' column
data = [("Anand", 25, "M"), ("Bala", 30, "F"), ("Kavitha", 28, "F"), ("Anand", 25, "M")]
columns = ["Name", "Age", "Gender"]

df = spark.createDataFrame(data, columns)

# Remove duplicates based on 'Name' and 'Age'
df_no_duplicates = df.dropDuplicates(["Name", "Age"])
df_no_duplicates.show()

Spark SQL:

SELECT DISTINCT Name, Age FROM people;

Output:

+-------+---+------+
|   Name|Age|Gender|
+-------+---+------+
|  Anand| 25|     M|
|   Bala| 30|     F|
|Kavitha| 28|     F|
+-------+---+------+

Example 3: Removing Duplicates with Null Values

PySpark:

# Add a row with a null value
data = [("Anand", 25), ("Bala", None), ("Kavitha", 28), ("Anand", 25), ("Bala", None)]
columns = ["Name", "Age"]

df = spark.createDataFrame(data, columns)

# Remove duplicates (including rows with null values)
df_no_duplicates = df.dropDuplicates()
df_no_duplicates.show()

Spark SQL:

SELECT DISTINCT * FROM people;

Output:

+-------+----+
|   Name| Age|
+-------+----+
|  Anand|  25|
|   Bala|null|
|Kavitha|  28|
+-------+----+

Example 4: Removing Duplicates Based on a Single Column

PySpark:

# Remove duplicates based on the 'Name' column
df_no_duplicates = df.dropDuplicates(["Name"])
df_no_duplicates.show()

Spark SQL:

SELECT DISTINCT Name FROM people;

Output:

+-------+
|   Name|
+-------+
|  Anand|
|   Bala|
|Kavitha|
+-------+

Example 5: Removing Duplicates with Complex Data

PySpark:

# Create DataFrame with complex data
data = [("Anand", 25, "M", "Sales"), 
        ("Bala", 30, "F", "HR"), 
        ("Kavitha", 28, "F", "HR"), 
        ("Anand", 25, "M", "Sales"), 
        ("Bala", 30, "F", "HR")]
columns = ["Name", "Age", "Gender", "Department"]

df = spark.createDataFrame(data, columns)

# Remove duplicates based on all columns
df_no_duplicates = df.dropDuplicates()
df_no_duplicates.show()

Spark SQL:

SELECT DISTINCT * FROM employees;

Output:

+-------+---+------+----------+
|   Name|Age|Gender|Department|
+-------+---+------+----------+
|  Anand| 25|     M|     Sales|
|   Bala| 30|     F|        HR|
|Kavitha| 28|     F|        HR|
+-------+---+------+----------+

Example 6: Removing Duplicates with Null Values in Subset Columns

PySpark:

# Add a row with a null value in the 'Age' column
data = [("Anand", 25, "M"), ("Bala", None, "F"), ("Kavitha", 28, "F"), ("Anand", 25, "M"), ("Bala", None, "F")]
columns = ["Name", "Age", "Gender"]

df = spark.createDataFrame(data, columns)

# Remove duplicates based on 'Name' and 'Gender'
df_no_duplicates = df.dropDuplicates(["Name", "Gender"])
df_no_duplicates.show()

Spark SQL:

SELECT DISTINCT Name, Gender FROM people;

Output:

+-------+----+------+
|   Name| Age|Gender|
+-------+----+------+
|  Anand|  25|     M|
|   Bala|null|     F|
|Kavitha|  28|     F|
+-------+----+------+

Example 7: Removing Duplicates with Custom Logic

PySpark:

from pyspark.sql.functions import col

# Remove duplicates based on a custom condition (e.g., 'Age' > 25)
df_no_duplicates = df.filter(col("Age") > 25).dropDuplicates(["Name"])
df_no_duplicates.show()

Spark SQL:

SELECT DISTINCT Name 
FROM people 
WHERE Age > 25;

Output:

+-------+
|   Name|
+-------+
|   Bala|
|Kavitha|
+-------+

5. Common Use Cases

Removing duplicate records from transactional data (e.g., logs, events).
Ensuring uniqueness in master data (e.g., customer, product data).
Preparing data for machine learning by removing redundant samples.

6. Performance Considerations

Use dropDuplicates() judiciously on large datasets, as it involves shuffling and sorting.
Specify a subset of columns to reduce the number of comparisons and improve performance.
Use proper partitioning and indexing to optimize duplicate removal operations.

7. Key Takeaways

The dropDuplicates() command is used to remove duplicate rows from a DataFrame.
It allows you to specify a subset of columns to consider when identifying duplicates.
Removing duplicates can be resource-intensive for large datasets, as it involves shuffling and sorting.
In Spark SQL, similar functionality can be achieved using SELECT DISTINCT.
Works efficiently on large datasets when combined with proper partitioning and caching.

​1. Syntax

​2. Parameters

​3. Return Type

​4. Examples

​Example 1: Removing Duplicate Rows from a DataFrame

​Example 2: Removing Duplicates Based on a Subset of Columns

​Example 3: Removing Duplicates with Null Values

​Example 4: Removing Duplicates Based on a Single Column

​Example 5: Removing Duplicates with Complex Data

​Example 6: Removing Duplicates with Null Values in Subset Columns

​Example 7: Removing Duplicates with Custom Logic

​5. Common Use Cases

​6. Performance Considerations

​7. Key Takeaways

1. Syntax

2. Parameters

3. Return Type

4. Examples

Example 1: Removing Duplicate Rows from a DataFrame

Example 2: Removing Duplicates Based on a Subset of Columns

Example 3: Removing Duplicates with Null Values

Example 4: Removing Duplicates Based on a Single Column

Example 5: Removing Duplicates with Complex Data

Example 6: Removing Duplicates with Null Values in Subset Columns

Example 7: Removing Duplicates with Custom Logic

5. Common Use Cases

6. Performance Considerations

7. Key Takeaways