The drop() command in Spark is used to remove one or more columns from a DataFrame. This is particularly useful when you need to clean up your dataset by removing unnecessary or redundant columns.


1. Syntax

PySpark:

df.drop(*cols)

Spark SQL:

SELECT col1, col2, ..., colN FROM table_name;

(Exclude the columns you want to drop from the SELECT statement.)

2. Parameters

  • cols: A list of column names (as strings) or column objects to be dropped from the DataFrame.

3. Return Type

  • Returns a new DataFrame with the specified columns removed.

4. Examples

Example 1: Dropping a Single Column

PySpark:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("DropExample").getOrCreate()

data = [("Anand", 25, "M"), ("Bala", 30, "F"), ("Kavitha", 28, "F"), ("Raj", 35, "M")]
columns = ["Name", "Age", "Gender"]

df = spark.createDataFrame(data, columns)

# Dropping the 'Gender' column
df_dropped = df.drop("Gender")
df_dropped.show()

Spark SQL:

SELECT Name, Age FROM people;

Output:

+-------+---+
|   Name|Age|
+-------+---+
|  Anand| 25|
|   Bala| 30|
| Kavitha| 28|
|    Raj| 35|
+-------+---+

Example 2: Dropping Multiple Columns

PySpark:

# Dropping multiple columns
df_dropped = df.drop("Age", "Gender")
df_dropped.show()

Spark SQL:

SELECT Name FROM people;

Output:

+-------+
|   Name|
+-------+
|  Anand|
|   Bala|
| Kavitha|
|    Raj|
+-------+

Example 3: Dropping Columns Using a List

PySpark:

# Dropping columns using a list
columns_to_drop = ["Age", "Gender"]
df_dropped = df.drop(*columns_to_drop)
df_dropped.show()

Spark SQL:

SELECT Name FROM people;

Output:

+-------+
|   Name|
+-------+
|  Anand|
|   Bala|
| Kavitha|
|    Raj|
+-------+

Example 4: Dropping Columns with Special Characters

PySpark:

# Dropping a column with special characters
df_with_special_chars = df.withColumnRenamed("Gender", "Gender/Sex")
df_dropped = df_with_special_chars.drop("Gender/Sex")
df_dropped.show()

Spark SQL:

SELECT Name, Age FROM people;

Output:

+-------+---+
|   Name|Age|
+-------+---+
|  Anand| 25|
|   Bala| 30|
| Kavitha| 28|
|    Raj| 35|
+-------+---+

Example 5: Dropping Columns in a DataFrame with Nested Structures

PySpark:

from pyspark.sql.types import StructType, StructField, StringType, IntegerType

schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Age", IntegerType(), True),
    StructField("Address", StructType([
        StructField("City", StringType(), True),
        StructField("State", StringType(), True)
    ]), True)
])

data = [("Anand", 25, ("Chennai", "Tamil Nadu")), 
        ("Bala", 30, ("Bangalore", "Karnataka")), 
        ("Kavitha", 28, ("Hyderabad", "Telangana"))]

df = spark.createDataFrame(data, schema)

# Dropping nested columns
df_dropped = df.drop("Address.City")
df_dropped.show(truncate=False)

Output:

+-------+---+------------+
|Name   |Age|Address     |
+-------+---+------------+
|Anand  |25 |{Tamil Nadu}|
|Bala   |30 |{Karnataka} |
|Kavitha|28 |{Telangana} |
+-------+---+------------+

Example 6: Dropping Columns Dynamically

PySpark:

# Dropping columns dynamically
columns_to_drop = ["Age", "Gender"]

for column in columns_to_drop:
    df = df.drop(column)

df.show()

Output:

+-------+
|   Name|
+-------+
|  Anand|
|   Bala|
| Kavitha|
|    Raj|
+-------+

5. Common Use Cases

  • Removing sensitive or irrelevant data before sharing or analysis.
  • Preparing data for machine learning by removing irrelevant features.
  • Cleaning up data after joins or transformations to remove redundant columns.

6. Performance Considerations

  • Dropping columns is a lightweight operation as it only changes metadata and does not involve data transformation.
  • Use this command to avoid data duplication or unnecessary transformations.

7. Key Takeaways

  1. The drop() command is used to remove one or more columns from a DataFrame.
  2. It allows you to drop columns by specifying their names or using a list of column names.
  3. Dropping columns is a metadata operation and does not involve data movement, making it very efficient.
  4. In Spark SQL, similar functionality can be achieved by excluding columns from the SELECT statement.
  5. Works efficiently on large datasets as it does not involve data transformation.