The dropDuplicates() command in Spark is used to remove duplicate rows from a DataFrame. It is similar to the distinct() command but provides more flexibility by allowing you to specify a subset of columns to consider when identifying duplicates. This is particularly useful when you want to remove duplicates based on specific columns rather than the entire row.


1. Syntax

PySpark:

df.dropDuplicates(subset=None)

Spark SQL:

SELECT DISTINCT col1, col2, ..., colN FROM table_name;

2. Parameters

  • subset: A list of column names (as strings) to consider when identifying duplicates. If None, all columns are considered.

3. Return Type

  • Returns a new DataFrame with duplicate rows removed.

4. Examples

Example 1: Removing Duplicate Rows from a DataFrame

PySpark:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("DropDuplicatesExample").getOrCreate()

# Create DataFrame
data = [("Anand", 25), ("Bala", 30), ("Kavitha", 28), ("Anand", 25)]
columns = ["Name", "Age"]

df = spark.createDataFrame(data, columns)

# Remove duplicate rows
df_no_duplicates = df.dropDuplicates()
df_no_duplicates.show()

Spark SQL:

SELECT DISTINCT * FROM people;

Output:

+-------+---+
|   Name|Age|
+-------+---+
|  Anand| 25|
|   Bala| 30|
|Kavitha| 28|
+-------+---+

Example 2: Removing Duplicates Based on a Subset of Columns

PySpark:

# Add a 'Gender' column
data = [("Anand", 25, "M"), ("Bala", 30, "F"), ("Kavitha", 28, "F"), ("Anand", 25, "M")]
columns = ["Name", "Age", "Gender"]

df = spark.createDataFrame(data, columns)

# Remove duplicates based on 'Name' and 'Age'
df_no_duplicates = df.dropDuplicates(["Name", "Age"])
df_no_duplicates.show()

Spark SQL:

SELECT DISTINCT Name, Age FROM people;

Output:

+-------+---+------+
|   Name|Age|Gender|
+-------+---+------+
|  Anand| 25|     M|
|   Bala| 30|     F|
|Kavitha| 28|     F|
+-------+---+------+

Example 3: Removing Duplicates with Null Values

PySpark:

# Add a row with a null value
data = [("Anand", 25), ("Bala", None), ("Kavitha", 28), ("Anand", 25), ("Bala", None)]
columns = ["Name", "Age"]

df = spark.createDataFrame(data, columns)

# Remove duplicates (including rows with null values)
df_no_duplicates = df.dropDuplicates()
df_no_duplicates.show()

Spark SQL:

SELECT DISTINCT * FROM people;

Output:

+-------+----+
|   Name| Age|
+-------+----+
|  Anand|  25|
|   Bala|null|
|Kavitha|  28|
+-------+----+

Example 4: Removing Duplicates Based on a Single Column

PySpark:

# Remove duplicates based on the 'Name' column
df_no_duplicates = df.dropDuplicates(["Name"])
df_no_duplicates.show()

Spark SQL:

SELECT DISTINCT Name FROM people;

Output:

+-------+
|   Name|
+-------+
|  Anand|
|   Bala|
|Kavitha|
+-------+

Example 5: Removing Duplicates with Complex Data

PySpark:

# Create DataFrame with complex data
data = [("Anand", 25, "M", "Sales"), 
        ("Bala", 30, "F", "HR"), 
        ("Kavitha", 28, "F", "HR"), 
        ("Anand", 25, "M", "Sales"), 
        ("Bala", 30, "F", "HR")]
columns = ["Name", "Age", "Gender", "Department"]

df = spark.createDataFrame(data, columns)

# Remove duplicates based on all columns
df_no_duplicates = df.dropDuplicates()
df_no_duplicates.show()

Spark SQL:

SELECT DISTINCT * FROM employees;

Output:

+-------+---+------+----------+
|   Name|Age|Gender|Department|
+-------+---+------+----------+
|  Anand| 25|     M|     Sales|
|   Bala| 30|     F|        HR|
|Kavitha| 28|     F|        HR|
+-------+---+------+----------+

Example 6: Removing Duplicates with Null Values in Subset Columns

PySpark:

# Add a row with a null value in the 'Age' column
data = [("Anand", 25, "M"), ("Bala", None, "F"), ("Kavitha", 28, "F"), ("Anand", 25, "M"), ("Bala", None, "F")]
columns = ["Name", "Age", "Gender"]

df = spark.createDataFrame(data, columns)

# Remove duplicates based on 'Name' and 'Gender'
df_no_duplicates = df.dropDuplicates(["Name", "Gender"])
df_no_duplicates.show()

Spark SQL:

SELECT DISTINCT Name, Gender FROM people;

Output:

+-------+----+------+
|   Name| Age|Gender|
+-------+----+------+
|  Anand|  25|     M|
|   Bala|null|     F|
|Kavitha|  28|     F|
+-------+----+------+

Example 7: Removing Duplicates with Custom Logic

PySpark:

from pyspark.sql.functions import col

# Remove duplicates based on a custom condition (e.g., 'Age' > 25)
df_no_duplicates = df.filter(col("Age") > 25).dropDuplicates(["Name"])
df_no_duplicates.show()

Spark SQL:

SELECT DISTINCT Name 
FROM people 
WHERE Age > 25;

Output:

+-------+
|   Name|
+-------+
|   Bala|
|Kavitha|
+-------+

5. Common Use Cases

  • Removing duplicate records from transactional data (e.g., logs, events).
  • Ensuring uniqueness in master data (e.g., customer, product data).
  • Preparing data for machine learning by removing redundant samples.

6. Performance Considerations

  • Use dropDuplicates() judiciously on large datasets, as it involves shuffling and sorting.
  • Specify a subset of columns to reduce the number of comparisons and improve performance.
  • Use proper partitioning and indexing to optimize duplicate removal operations.

7. Key Takeaways

  1. The dropDuplicates() command is used to remove duplicate rows from a DataFrame.
  2. It allows you to specify a subset of columns to consider when identifying duplicates.
  3. Removing duplicates can be resource-intensive for large datasets, as it involves shuffling and sorting.
  4. In Spark SQL, similar functionality can be achieved using SELECT DISTINCT.
  5. Works efficiently on large datasets when combined with proper partitioning and caching.