The columns
attribute in Spark is used to retrieve the list of column names in a DataFrame. It provides a quick and easy way to inspect the structure of the DataFrame and access the names of all columns. This is particularly useful for debugging, data exploration, and dynamic column access.
1. Syntax
PySpark:
Spark SQL:
- There is no direct equivalent in Spark SQL, but you can use
DESCRIBE table_name
to achieve similar results.
2. Return Type
- Returns a list of strings, where each string is the name of a column in the DataFrame.
3. Key Features
- Column Names: Provides a list of all column names in the DataFrame.
- Efficient: It is a metadata operation and does not involve data processing.
- Dynamic Access: Useful for dynamically accessing or manipulating columns.
4. Examples
Example 1: Retrieving Column Names of a DataFrame
PySpark:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("ColumnsExample").getOrCreate()
# Create DataFrame
data = [("Anand", 25, 3000), ("Bala", 30, 4000), ("Kavitha", 28, 3500), ("Raj", 35, 4500)]
columns = ["Name", "Age", "Salary"]
df = spark.createDataFrame(data, columns)
# Retrieve column names
print(df.columns)
Output:
['Name', 'Age', 'Salary']
Example 2: Using Column Names for Dynamic Column Access
PySpark:
# Access columns dynamically
for column in df.columns:
print(f"Column: {column}, Data Type: {df.schema[column].dataType}")
Output:
Column: Name, Data Type: StringType
Column: Age, Data Type: IntegerType
Column: Salary, Data Type: IntegerType
Example 3: Filtering Columns Based on Name
PySpark:
# Filter columns that start with 'A'
filtered_columns = [col for col in df.columns if col.startswith("A")]
print(filtered_columns)
Output:
Example 4: Retrieving Column Names of a DataFrame with Nested Structures
PySpark:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, ArrayType
# Define schema with nested structures
schema = StructType([
StructField("Name", StringType(), True),
StructField("Age", IntegerType(), True),
StructField("Skills", ArrayType(StringType()), True)
])
# Create DataFrame with nested data
data = [("Anand", 25, ["Java", "Python"]),
("Bala", 30, ["Scala", "Spark"]),
("Kavitha", 28, ["SQL", "Hadoop"])]
df = spark.createDataFrame(data, schema)
# Retrieve column names
print(df.columns)
Output:
['Name', 'Age', 'Skills']
Example 5: Retrieving Column Names of a DataFrame with Structs
PySpark:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
# Define schema with structs
schema = StructType([
StructField("Name", StringType(), True),
StructField("Address", StructType([
StructField("City", StringType(), True),
StructField("State", StringType(), True)
]), True)
])
# Create DataFrame with struct data
data = [("Anand", ("Chennai", "Tamil Nadu")),
("Bala", ("Bangalore", "Karnataka")),
("Kavitha", ("Hyderabad", "Telangana"))]
df = spark.createDataFrame(data, schema)
# Retrieve column names
print(df.columns)
Output:
Example 6: Retrieving Column Names of a DataFrame with Timestamps
PySpark:
from pyspark.sql.types import StructType, StructField, StringType, TimestampType
# Define schema with a timestamp column
schema = StructType([
StructField("Event", StringType(), True),
StructField("Timestamp", TimestampType(), True)
])
# Create DataFrame
data = [("Login", "2023-10-01 10:00:00"),
("Logout", "2023-10-01 12:00:00")]
df = spark.createDataFrame(data, schema)
# Retrieve column names
print(df.columns)
Output:
Example 7: Retrieving Column Names of a DataFrame with Maps
PySpark:
from pyspark.sql.types import StructType, StructField, StringType, MapType, IntegerType
# Define schema with a map column
schema = StructType([
StructField("Name", StringType(), True),
StructField("Skills", MapType(StringType(), IntegerType()), True)
])
# Create DataFrame
data = [("Anand", {"Java": 5, "Python": 3}),
("Bala", {"Scala": 4, "Spark": 2})]
df = spark.createDataFrame(data, schema)
# Retrieve column names
print(df.columns)
Output:
5. Common Use Cases
- Inspecting the structure of a DataFrame after reading data.
- Dynamically accessing or manipulating columns.
- Filtering or selecting columns based on their names.
- Using
columns
is lightweight and does not involve data movement or processing.
- It is particularly useful for debugging and understanding the structure of complex DataFrames.
7. Key Takeaways
- Purpose: The
columns
attribute is used to retrieve the list of column names in a DataFrame.
- Column Names: Provides a quick way to inspect the structure of the DataFrame.
- In Spark SQL, similar functionality can be achieved using
DESCRIBE table_name
.