Documentation Index
Fetch the complete documentation index at: https://rajanand.org/llms.txt
Use this file to discover all available pages before exploring further.
The drop() command in Spark is used to remove one or more columns from a DataFrame. This is particularly useful when you need to clean up your dataset by removing unnecessary or redundant columns.
1. Syntax
PySpark:
Spark SQL:
SELECT col1, col2, ..., colN FROM table_name;
(Exclude the columns you want to drop from the SELECT statement.)
2. Parameters
- cols: A list of column names (as strings) or column objects to be dropped from the DataFrame.
3. Return Type
- Returns a new DataFrame with the specified columns removed.
4. Examples
Example 1: Dropping a Single Column
PySpark:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("DropExample").getOrCreate()
data = [("Anand", 25, "M"), ("Bala", 30, "F"), ("Kavitha", 28, "F"), ("Raj", 35, "M")]
columns = ["Name", "Age", "Gender"]
df = spark.createDataFrame(data, columns)
# Dropping the 'Gender' column
df_dropped = df.drop("Gender")
df_dropped.show()
Spark SQL:
SELECT Name, Age FROM people;
Output:
+-------+---+
| Name|Age|
+-------+---+
| Anand| 25|
| Bala| 30|
| Kavitha| 28|
| Raj| 35|
+-------+---+
Example 2: Dropping Multiple Columns
PySpark:
# Dropping multiple columns
df_dropped = df.drop("Age", "Gender")
df_dropped.show()
Spark SQL:
Output:
+-------+
| Name|
+-------+
| Anand|
| Bala|
| Kavitha|
| Raj|
+-------+
Example 3: Dropping Columns Using a List
PySpark:
# Dropping columns using a list
columns_to_drop = ["Age", "Gender"]
df_dropped = df.drop(*columns_to_drop)
df_dropped.show()
Spark SQL:
Output:
+-------+
| Name|
+-------+
| Anand|
| Bala|
| Kavitha|
| Raj|
+-------+
Example 4: Dropping Columns with Special Characters
PySpark:
# Dropping a column with special characters
df_with_special_chars = df.withColumnRenamed("Gender", "Gender/Sex")
df_dropped = df_with_special_chars.drop("Gender/Sex")
df_dropped.show()
Spark SQL:
SELECT Name, Age FROM people;
Output:
+-------+---+
| Name|Age|
+-------+---+
| Anand| 25|
| Bala| 30|
| Kavitha| 28|
| Raj| 35|
+-------+---+
Example 5: Dropping Columns in a DataFrame with Nested Structures
PySpark:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
schema = StructType([
StructField("Name", StringType(), True),
StructField("Age", IntegerType(), True),
StructField("Address", StructType([
StructField("City", StringType(), True),
StructField("State", StringType(), True)
]), True)
])
data = [("Anand", 25, ("Chennai", "Tamil Nadu")),
("Bala", 30, ("Bangalore", "Karnataka")),
("Kavitha", 28, ("Hyderabad", "Telangana"))]
df = spark.createDataFrame(data, schema)
# Dropping nested columns
df_dropped = df.drop("Address.City")
df_dropped.show(truncate=False)
Output:
+-------+---+------------+
|Name |Age|Address |
+-------+---+------------+
|Anand |25 |{Tamil Nadu}|
|Bala |30 |{Karnataka} |
|Kavitha|28 |{Telangana} |
+-------+---+------------+
Example 6: Dropping Columns Dynamically
PySpark:
# Dropping columns dynamically
columns_to_drop = ["Age", "Gender"]
for column in columns_to_drop:
df = df.drop(column)
df.show()
Output:
+-------+
| Name|
+-------+
| Anand|
| Bala|
| Kavitha|
| Raj|
+-------+
5. Common Use Cases
- Removing sensitive or irrelevant data before sharing or analysis.
- Preparing data for machine learning by removing irrelevant features.
- Cleaning up data after joins or transformations to remove redundant columns.
- Dropping columns is a lightweight operation as it only changes metadata and does not involve data transformation.
- Use this command to avoid data duplication or unnecessary transformations.
7. Key Takeaways
- The
drop() command is used to remove one or more columns from a DataFrame.
- It allows you to drop columns by specifying their names or using a list of column names.
- Dropping columns is a metadata operation and does not involve data movement, making it very efficient.
- In Spark SQL, similar functionality can be achieved by excluding columns from the
SELECT statement.
- Works efficiently on large datasets as it does not involve data transformation.