The columns attribute in Spark is used to retrieve the list of column names in a DataFrame. It provides a quick and easy way to inspect the structure of the DataFrame and access the names of all columns. This is particularly useful for debugging, data exploration, and dynamic column access.


1. Syntax

PySpark:

df.columns

Spark SQL:

  • There is no direct equivalent in Spark SQL, but you can use DESCRIBE table_name to achieve similar results.

2. Return Type

  • Returns a list of strings, where each string is the name of a column in the DataFrame.

3. Key Features

  • Column Names: Provides a list of all column names in the DataFrame.
  • Efficient: It is a metadata operation and does not involve data processing.
  • Dynamic Access: Useful for dynamically accessing or manipulating columns.

4. Examples

Example 1: Retrieving Column Names of a DataFrame

PySpark:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("ColumnsExample").getOrCreate()

# Create DataFrame
data = [("Anand", 25, 3000), ("Bala", 30, 4000), ("Kavitha", 28, 3500), ("Raj", 35, 4500)]
columns = ["Name", "Age", "Salary"]

df = spark.createDataFrame(data, columns)

# Retrieve column names
print(df.columns)

Output:

['Name', 'Age', 'Salary']

Example 2: Using Column Names for Dynamic Column Access

PySpark:

# Access columns dynamically
for column in df.columns:
    print(f"Column: {column}, Data Type: {df.schema[column].dataType}")

Output:

Column: Name, Data Type: StringType
Column: Age, Data Type: IntegerType
Column: Salary, Data Type: IntegerType

Example 3: Filtering Columns Based on Name

PySpark:

# Filter columns that start with 'A'
filtered_columns = [col for col in df.columns if col.startswith("A")]
print(filtered_columns)

Output:

['Age']

Example 4: Retrieving Column Names of a DataFrame with Nested Structures

PySpark:

from pyspark.sql.types import StructType, StructField, StringType, IntegerType, ArrayType

# Define schema with nested structures
schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Age", IntegerType(), True),
    StructField("Skills", ArrayType(StringType()), True)
])

# Create DataFrame with nested data
data = [("Anand", 25, ["Java", "Python"]), 
        ("Bala", 30, ["Scala", "Spark"]), 
        ("Kavitha", 28, ["SQL", "Hadoop"])]
df = spark.createDataFrame(data, schema)

# Retrieve column names
print(df.columns)

Output:

['Name', 'Age', 'Skills']

Example 5: Retrieving Column Names of a DataFrame with Structs

PySpark:

from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# Define schema with structs
schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Address", StructType([
        StructField("City", StringType(), True),
        StructField("State", StringType(), True)
    ]), True)
])

# Create DataFrame with struct data
data = [("Anand", ("Chennai", "Tamil Nadu")), 
        ("Bala", ("Bangalore", "Karnataka")), 
        ("Kavitha", ("Hyderabad", "Telangana"))]
df = spark.createDataFrame(data, schema)

# Retrieve column names
print(df.columns)

Output:

['Name', 'Address']

Example 6: Retrieving Column Names of a DataFrame with Timestamps

PySpark:

from pyspark.sql.types import StructType, StructField, StringType, TimestampType

# Define schema with a timestamp column
schema = StructType([
    StructField("Event", StringType(), True),
    StructField("Timestamp", TimestampType(), True)
])

# Create DataFrame
data = [("Login", "2023-10-01 10:00:00"), 
        ("Logout", "2023-10-01 12:00:00")]
df = spark.createDataFrame(data, schema)

# Retrieve column names
print(df.columns)

Output:

['Event', 'Timestamp']

Example 7: Retrieving Column Names of a DataFrame with Maps

PySpark:

from pyspark.sql.types import StructType, StructField, StringType, MapType, IntegerType

# Define schema with a map column
schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Skills", MapType(StringType(), IntegerType()), True)
])

# Create DataFrame
data = [("Anand", {"Java": 5, "Python": 3}), 
        ("Bala", {"Scala": 4, "Spark": 2})]
df = spark.createDataFrame(data, schema)

# Retrieve column names
print(df.columns)

Output:

['Name', 'Skills']

5. Common Use Cases

  • Inspecting the structure of a DataFrame after reading data.
  • Dynamically accessing or manipulating columns.
  • Filtering or selecting columns based on their names.

6. Performance Considerations

  • Using columns is lightweight and does not involve data movement or processing.
  • It is particularly useful for debugging and understanding the structure of complex DataFrames.

7. Key Takeaways

  1. Purpose: The columns attribute is used to retrieve the list of column names in a DataFrame.
  2. Column Names: Provides a quick way to inspect the structure of the DataFrame.
  3. In Spark SQL, similar functionality can be achieved using DESCRIBE table_name.