Spark: dtypes attribute

The dtypes attribute in Spark is used to retrieve the schema of a DataFrame in the form of a list of tuples, where each tuple contains the column name and its corresponding data type. This is particularly useful for inspecting the structure of the data and understanding the data types of each column.

1. Syntax

PySpark:

df.dtypes

Spark SQL:

There is no direct equivalent in Spark SQL, but you can use DESCRIBE table_name to achieve similar results.

2. Return Type

Returns a list of tuples, where each tuple contains:
- Column Name: The name of the column.
- Data Type: The data type of the column (e.g., string, int, double).

3. Key Features

Schema Inspection: Provides a quick way to inspect the schema of a DataFrame.
Data Types: Lists the data types of all columns in the DataFrame.
Efficient: It is a metadata operation and does not involve data processing.

4. Examples

Example 1: Retrieving Data Types of All Columns

PySpark:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("DtypesExample").getOrCreate()

# Create DataFrame
data = [("Anand", 25, 3000.50), ("Bala", 30, 4000.75), ("Kavitha", 28, 3500.25)]
columns = ["Name", "Age", "Salary"]

df = spark.createDataFrame(data, columns)

# Retrieve data types of all columns
print(df.dtypes)

Output:

[('Name', 'string'), ('Age', 'int'), ('Salary', 'double')]

Example 2: Retrieving Data Types of a DataFrame with Nested Structures

PySpark:

from pyspark.sql.types import StructType, StructField, StringType, IntegerType, ArrayType

# Define schema with nested structures
schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Age", IntegerType(), True),
    StructField("Skills", ArrayType(StringType()), True)
])

# Create DataFrame with nested data
data = [("Anand", 25, ["Java", "Python"]), 
        ("Bala", 30, ["Scala", "Spark"]), 
        ("Kavitha", 28, ["SQL", "Hadoop"])]
df = spark.createDataFrame(data, schema)

# Retrieve data types of all columns
print(df.dtypes)

Output:

[('Name', 'string'), ('Age', 'int'), ('Skills', 'array<string>')]

Example 3: Retrieving Data Types of a DataFrame with Structs

PySpark:

from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# Define schema with structs
schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Address", StructType([
        StructField("City", StringType(), True),
        StructField("State", StringType(), True)
    ]), True)
])

# Create DataFrame with struct data
data = [("Anand", ("Chennai", "Tamil Nadu")), 
        ("Bala", ("Bangalore", "Karnataka")), 
        ("Kavitha", ("Hyderabad", "Telangana"))]
df = spark.createDataFrame(data, schema)

# Retrieve data types of all columns
print(df.dtypes)

Output:

[('Name', 'string'), ('Address', 'struct<City:string,State:string>')]

Example 4: Retrieving Data Types of a DataFrame with Nullable Columns

PySpark:

from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# Define schema with nullable and non-nullable columns
schema = StructType([
    StructField("Name", StringType(), False),  # Non-nullable
    StructField("Age", IntegerType(), True)    # Nullable
])

# Create DataFrame
data = [("Anand", 25), ("Bala", 30), ("Kavitha", 28), ("Raj", 35)]
df = spark.createDataFrame(data, schema)

# Retrieve data types of all columns
print(df.dtypes)

Output:

[('Name', 'string'), ('Age', 'int')]

Example 5: Retrieving Data Types of a DataFrame with Timestamps

PySpark:

from pyspark.sql.types import StructType, StructField, StringType, TimestampType

# Define schema with a timestamp column
schema = StructType([
    StructField("Event", StringType(), True),
    StructField("Timestamp", TimestampType(), True)
])

# Create DataFrame
data = [("Login", "2023-10-01 10:00:00"), 
        ("Logout", "2023-10-01 12:00:00")]
df = spark.createDataFrame(data, schema)

# Retrieve data types of all columns
print(df.dtypes)

Output:

[('Event', 'string'), ('Timestamp', 'timestamp')]

Example 6: Retrieving Data Types of a DataFrame with Maps

PySpark:

from pyspark.sql.types import StructType, StructField, StringType, MapType, IntegerType

# Define schema with a map column
schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Skills", MapType(StringType(), IntegerType()), True)
])

# Create DataFrame
data = [("Anand", {"Java": 5, "Python": 3}), 
        ("Bala", {"Scala": 4, "Spark": 2})]
df = spark.createDataFrame(data, schema)

# Retrieve data types of all columns
print(df.dtypes)

Output:

[('Name', 'string'), ('Skills', 'map<string,int>')]

5. Common Use Cases

Inspecting the schema of a DataFrame after reading data.
Debugging schema mismatches or errors.
Verifying the schema after transformations or joins.

6. Performance Considerations

Using dtypes is lightweight and does not involve data movement or processing.
It is particularly useful for debugging and understanding the structure of complex DataFrames.

7. Key Takeaways

Purpose: The dtypes attribute is used to retrieve the schema of a DataFrame in the form of a list of tuples.
Schema Inspection: It provides a quick way to inspect the schema and data types of all columns.
dtypes is a metadata operation and does not involve data processing, making it very efficient.
In Spark SQL, similar functionality can be achieved using DESCRIBE table_name.

​1. Syntax

​2. Return Type

​3. Key Features

​4. Examples

​Example 1: Retrieving Data Types of All Columns

​Example 2: Retrieving Data Types of a DataFrame with Nested Structures

​Example 3: Retrieving Data Types of a DataFrame with Structs

​Example 4: Retrieving Data Types of a DataFrame with Nullable Columns

​Example 5: Retrieving Data Types of a DataFrame with Timestamps

​Example 6: Retrieving Data Types of a DataFrame with Maps

​5. Common Use Cases

​6. Performance Considerations

​7. Key Takeaways

1. Syntax

2. Return Type

3. Key Features

4. Examples

Example 1: Retrieving Data Types of All Columns

Example 2: Retrieving Data Types of a DataFrame with Nested Structures

Example 3: Retrieving Data Types of a DataFrame with Structs

Example 4: Retrieving Data Types of a DataFrame with Nullable Columns

Example 5: Retrieving Data Types of a DataFrame with Timestamps

Example 6: Retrieving Data Types of a DataFrame with Maps

5. Common Use Cases

6. Performance Considerations

7. Key Takeaways