Reference
Spark: drop function
The drop()
command in Spark is used to remove one or more columns from a DataFrame. This is particularly useful when you need to clean up your dataset by removing unnecessary or redundant columns.
1. Syntax
PySpark:
Spark SQL:
(Exclude the columns you want to drop from the SELECT
statement.)
2. Parameters
- cols: A list of column names (as strings) or column objects to be dropped from the DataFrame.
3. Return Type
- Returns a new DataFrame with the specified columns removed.
4. Examples
Example 1: Dropping a Single Column
PySpark:
Spark SQL:
Output:
Example 2: Dropping Multiple Columns
PySpark:
Spark SQL:
Output:
Example 3: Dropping Columns Using a List
PySpark:
Spark SQL:
Output:
Example 4: Dropping Columns with Special Characters
PySpark:
Spark SQL:
Output:
Example 5: Dropping Columns in a DataFrame with Nested Structures
PySpark:
Output:
Example 6: Dropping Columns Dynamically
PySpark:
Output:
5. Common Use Cases
- Removing sensitive or irrelevant data before sharing or analysis.
- Preparing data for machine learning by removing irrelevant features.
- Cleaning up data after joins or transformations to remove redundant columns.
6. Performance Considerations
- Dropping columns is a lightweight operation as it only changes metadata and does not involve data transformation.
- Use this command to avoid data duplication or unnecessary transformations.
7. Key Takeaways
- The
drop()
command is used to remove one or more columns from a DataFrame. - It allows you to drop columns by specifying their names or using a list of column names.
- Dropping columns is a metadata operation and does not involve data movement, making it very efficient.
- In Spark SQL, similar functionality can be achieved by excluding columns from the
SELECT
statement. - Works efficiently on large datasets as it does not involve data transformation.