Spark: printSchema function

The printSchema() function in Spark is used to display the schema of a DataFrame or Dataset. It provides a tree-like structure that shows the column names, data types, and whether the columns are nullable. This is particularly useful for understanding the structure of the data and debugging schema-related issues.

1. Syntax

PySpark:

df.printSchema()

Spark SQL:

There is no direct equivalent in Spark SQL, but you can use DESCRIBE table_name to achieve similar results.

2. Key Features

Schema Representation: Displays the schema in a tree-like format.
Column Details: Shows column names, data types, and nullability.
Nested Structures: Handles nested structures (e.g., arrays, structs) by displaying them hierarchically.

3. Examples

Example 1: Displaying the Schema of a Simple DataFrame

PySpark:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("PrintSchemaExample").getOrCreate()

# Create DataFrame
data = [("Anand", 25), ("Bala", 30), ("Kavitha", 28), ("Raj", 35)]
columns = ["Name", "Age"]

df = spark.createDataFrame(data, columns)

# Display the schema
df.printSchema()

Output:

root
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)

Example 2: Displaying the Schema of a DataFrame with Nested Structures

PySpark:

from pyspark.sql.types import StructType, StructField, StringType, IntegerType, ArrayType

# Define schema with nested structures
schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Age", IntegerType(), True),
    StructField("Skills", ArrayType(StringType()), True)
])

# Create DataFrame with nested data
data = [("Anand", 25, ["Java", "Python"]), 
        ("Bala", 30, ["Scala", "Spark"]), 
        ("Kavitha", 28, ["SQL", "Hadoop"])]
df = spark.createDataFrame(data, schema)

# Display the schema
df.printSchema()

Output:

root
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Skills: array (nullable = true)
 |    |-- element: string (containsNull = true)

Example 3: Displaying the Schema of a DataFrame with Structs

PySpark:

from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# Define schema with structs
schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Address", StructType([
        StructField("City", StringType(), True),
        StructField("State", StringType(), True)
    ]), True)
])

# Create DataFrame with struct data
data = [("Anand", ("Chennai", "Tamil Nadu")), 
        ("Bala", ("Bangalore", "Karnataka")), 
        ("Kavitha", ("Hyderabad", "Telangana"))]
df = spark.createDataFrame(data, schema)

# Display the schema
df.printSchema()

Output:

root
 |-- Name: string (nullable = true)
 |-- Address: struct (nullable = true)
 |    |-- City: string (nullable = true)
 |    |-- State: string (nullable = true)

Example 4: Displaying the Schema of a DataFrame with Nullable Columns

PySpark:

from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# Define schema with nullable and non-nullable columns
schema = StructType([
    StructField("Name", StringType(), False),  # Non-nullable
    StructField("Age", IntegerType(), True)    # Nullable
])

# Create DataFrame
data = [("Anand", 25), ("Bala", 30), ("Kavitha", 28), ("Raj", 35)]
df = spark.createDataFrame(data, schema)

# Display the schema
df.printSchema()

Output:

root
 |-- Name: string (nullable = false)
 |-- Age: integer (nullable = true)

Example 5: Displaying the Schema of a DataFrame with Timestamps

PySpark:

from pyspark.sql.types import StructType, StructField, StringType, TimestampType

# Define schema with a timestamp column
schema = StructType([
    StructField("Event", StringType(), True),
    StructField("Timestamp", TimestampType(), True)
])

# Create DataFrame
data = [("Login", "2023-10-01 10:00:00"), 
        ("Logout", "2023-10-01 12:00:00")]
df = spark.createDataFrame(data, schema)

# Display the schema
df.printSchema()

Output:

root
 |-- Event: string (nullable = true)
 |-- Timestamp: timestamp (nullable = true)

Example 6: Displaying the Schema of a DataFrame with Maps

PySpark:

from pyspark.sql.types import StructType, StructField, StringType, MapType, IntegerType

# Define schema with a map column
schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Skills", MapType(StringType(), IntegerType()), True)
])

# Create DataFrame
data = [("Anand", {"Java": 5, "Python": 3}), 
        ("Bala", {"Scala": 4, "Spark": 2})]
df = spark.createDataFrame(data, schema)

# Display the schema
df.printSchema()

Output:

root
 |-- Name: string (nullable = true)
 |-- Skills: map (nullable = true)
 |    |-- key: string
 |    |-- value: integer (valueContainsNull = true)

4. Common Use Cases

Inspecting the schema of a DataFrame after reading data.
Debugging schema mismatches or errors.
Verifying the schema after transformations or joins.

5. Performance Considerations

Using printSchema() is lightweight and does not involve data movement or processing.
It is particularly useful for debugging and understanding the structure of complex DataFrames.

6. Key Takeaways

The printSchema() function is used to display the schema of a DataFrame or Dataset.
It provides a tree-like structure that shows column names, data types, and nullability.
printSchema() is a metadata operation and does not involve data processing, making it very efficient.
In Spark SQL, similar functionality can be achieved using DESCRIBE table_name.

​1. Syntax

​2. Key Features

​3. Examples

​Example 1: Displaying the Schema of a Simple DataFrame

​Example 2: Displaying the Schema of a DataFrame with Nested Structures

​Example 3: Displaying the Schema of a DataFrame with Structs

​Example 4: Displaying the Schema of a DataFrame with Nullable Columns

​Example 5: Displaying the Schema of a DataFrame with Timestamps

​Example 6: Displaying the Schema of a DataFrame with Maps

​4. Common Use Cases

​5. Performance Considerations

​6. Key Takeaways

1. Syntax

2. Key Features

3. Examples

Example 1: Displaying the Schema of a Simple DataFrame

Example 2: Displaying the Schema of a DataFrame with Nested Structures

Example 3: Displaying the Schema of a DataFrame with Structs

Example 4: Displaying the Schema of a DataFrame with Nullable Columns

Example 5: Displaying the Schema of a DataFrame with Timestamps

Example 6: Displaying the Schema of a DataFrame with Maps

4. Common Use Cases

5. Performance Considerations

6. Key Takeaways