The dtypes attribute in Spark is used to retrieve the schema of a DataFrame in the form of a list of tuples, where each tuple contains the column name and its corresponding data type. This is particularly useful for inspecting the structure of the data and understanding the data types of each column.


1. Syntax

PySpark:

df.dtypes

Spark SQL:

  • There is no direct equivalent in Spark SQL, but you can use DESCRIBE table_name to achieve similar results.

2. Return Type

  • Returns a list of tuples, where each tuple contains:
    • Column Name: The name of the column.
    • Data Type: The data type of the column (e.g., string, int, double).

3. Key Features

  • Schema Inspection: Provides a quick way to inspect the schema of a DataFrame.
  • Data Types: Lists the data types of all columns in the DataFrame.
  • Efficient: It is a metadata operation and does not involve data processing.

4. Examples

Example 1: Retrieving Data Types of All Columns

PySpark:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("DtypesExample").getOrCreate()

# Create DataFrame
data = [("Anand", 25, 3000.50), ("Bala", 30, 4000.75), ("Kavitha", 28, 3500.25)]
columns = ["Name", "Age", "Salary"]

df = spark.createDataFrame(data, columns)

# Retrieve data types of all columns
print(df.dtypes)

Output:

[('Name', 'string'), ('Age', 'int'), ('Salary', 'double')]

Example 2: Retrieving Data Types of a DataFrame with Nested Structures

PySpark:

from pyspark.sql.types import StructType, StructField, StringType, IntegerType, ArrayType

# Define schema with nested structures
schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Age", IntegerType(), True),
    StructField("Skills", ArrayType(StringType()), True)
])

# Create DataFrame with nested data
data = [("Anand", 25, ["Java", "Python"]), 
        ("Bala", 30, ["Scala", "Spark"]), 
        ("Kavitha", 28, ["SQL", "Hadoop"])]
df = spark.createDataFrame(data, schema)

# Retrieve data types of all columns
print(df.dtypes)

Output:

[('Name', 'string'), ('Age', 'int'), ('Skills', 'array<string>')]

Example 3: Retrieving Data Types of a DataFrame with Structs

PySpark:

from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# Define schema with structs
schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Address", StructType([
        StructField("City", StringType(), True),
        StructField("State", StringType(), True)
    ]), True)
])

# Create DataFrame with struct data
data = [("Anand", ("Chennai", "Tamil Nadu")), 
        ("Bala", ("Bangalore", "Karnataka")), 
        ("Kavitha", ("Hyderabad", "Telangana"))]
df = spark.createDataFrame(data, schema)

# Retrieve data types of all columns
print(df.dtypes)

Output:

[('Name', 'string'), ('Address', 'struct<City:string,State:string>')]

Example 4: Retrieving Data Types of a DataFrame with Nullable Columns

PySpark:

from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# Define schema with nullable and non-nullable columns
schema = StructType([
    StructField("Name", StringType(), False),  # Non-nullable
    StructField("Age", IntegerType(), True)    # Nullable
])

# Create DataFrame
data = [("Anand", 25), ("Bala", 30), ("Kavitha", 28), ("Raj", 35)]
df = spark.createDataFrame(data, schema)

# Retrieve data types of all columns
print(df.dtypes)

Output:

[('Name', 'string'), ('Age', 'int')]

Example 5: Retrieving Data Types of a DataFrame with Timestamps

PySpark:

from pyspark.sql.types import StructType, StructField, StringType, TimestampType

# Define schema with a timestamp column
schema = StructType([
    StructField("Event", StringType(), True),
    StructField("Timestamp", TimestampType(), True)
])

# Create DataFrame
data = [("Login", "2023-10-01 10:00:00"), 
        ("Logout", "2023-10-01 12:00:00")]
df = spark.createDataFrame(data, schema)

# Retrieve data types of all columns
print(df.dtypes)

Output:

[('Event', 'string'), ('Timestamp', 'timestamp')]

Example 6: Retrieving Data Types of a DataFrame with Maps

PySpark:

from pyspark.sql.types import StructType, StructField, StringType, MapType, IntegerType

# Define schema with a map column
schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Skills", MapType(StringType(), IntegerType()), True)
])

# Create DataFrame
data = [("Anand", {"Java": 5, "Python": 3}), 
        ("Bala", {"Scala": 4, "Spark": 2})]
df = spark.createDataFrame(data, schema)

# Retrieve data types of all columns
print(df.dtypes)

Output:

[('Name', 'string'), ('Skills', 'map<string,int>')]

5. Common Use Cases

  • Inspecting the schema of a DataFrame after reading data.
  • Debugging schema mismatches or errors.
  • Verifying the schema after transformations or joins.

6. Performance Considerations

  • Using dtypes is lightweight and does not involve data movement or processing.
  • It is particularly useful for debugging and understanding the structure of complex DataFrames.

7. Key Takeaways

  1. Purpose: The dtypes attribute is used to retrieve the schema of a DataFrame in the form of a list of tuples.
  2. Schema Inspection: It provides a quick way to inspect the schema and data types of all columns.
  3. dtypes is a metadata operation and does not involve data processing, making it very efficient.
  4. In Spark SQL, similar functionality can be achieved using DESCRIBE table_name.