The distinct()
command in Spark is used to remove duplicate rows from a DataFrame. It returns a new DataFrame containing only the unique rows based on all columns or a subset of columns. This is particularly useful when you need to eliminate redundant data and ensure that each row in your dataset is unique.
PySpark:
Spark SQL:
distinct()
method does not take any parameters. It operates on the entire DataFrame by default.PySpark:
Spark SQL:
Output:
PySpark:
Spark SQL:
Output:
PySpark:
Spark SQL:
Output:
PySpark:
Spark SQL:
Output:
dropDuplicates()
for Subset of ColumnsPySpark:
Spark SQL:
Output:
PySpark:
Spark SQL:
Output:
PySpark:
Spark SQL:
Output:
distinct()
judiciously on large datasets, as it involves shuffling and sorting, which can be expensive.dropDuplicates()
if you only need to remove duplicates based on a subset of columns, as it can be more efficient.distinct()
command is used to remove duplicate rows from a DataFrame.SELECT DISTINCT
.