Documentation Index
Fetch the complete documentation index at: https://rajanand.org/llms.txt
Use this file to discover all available pages before exploring further.
The dtypes attribute in Spark is used to retrieve the schema of a DataFrame in the form of a list of tuples, where each tuple contains the column name and its corresponding data type. This is particularly useful for inspecting the structure of the data and understanding the data types of each column.
1. Syntax
PySpark:
Spark SQL:
- There is no direct equivalent in Spark SQL, but you can use
DESCRIBE table_name to achieve similar results.
2. Return Type
- Returns a list of tuples, where each tuple contains:
- Column Name: The name of the column.
- Data Type: The data type of the column (e.g.,
string, int, double).
3. Key Features
- Schema Inspection: Provides a quick way to inspect the schema of a DataFrame.
- Data Types: Lists the data types of all columns in the DataFrame.
- Efficient: It is a metadata operation and does not involve data processing.
4. Examples
Example 1: Retrieving Data Types of All Columns
PySpark:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("DtypesExample").getOrCreate()
# Create DataFrame
data = [("Anand", 25, 3000.50), ("Bala", 30, 4000.75), ("Kavitha", 28, 3500.25)]
columns = ["Name", "Age", "Salary"]
df = spark.createDataFrame(data, columns)
# Retrieve data types of all columns
print(df.dtypes)
Output:
[('Name', 'string'), ('Age', 'int'), ('Salary', 'double')]
Example 2: Retrieving Data Types of a DataFrame with Nested Structures
PySpark:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, ArrayType
# Define schema with nested structures
schema = StructType([
StructField("Name", StringType(), True),
StructField("Age", IntegerType(), True),
StructField("Skills", ArrayType(StringType()), True)
])
# Create DataFrame with nested data
data = [("Anand", 25, ["Java", "Python"]),
("Bala", 30, ["Scala", "Spark"]),
("Kavitha", 28, ["SQL", "Hadoop"])]
df = spark.createDataFrame(data, schema)
# Retrieve data types of all columns
print(df.dtypes)
Output:
[('Name', 'string'), ('Age', 'int'), ('Skills', 'array<string>')]
Example 3: Retrieving Data Types of a DataFrame with Structs
PySpark:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
# Define schema with structs
schema = StructType([
StructField("Name", StringType(), True),
StructField("Address", StructType([
StructField("City", StringType(), True),
StructField("State", StringType(), True)
]), True)
])
# Create DataFrame with struct data
data = [("Anand", ("Chennai", "Tamil Nadu")),
("Bala", ("Bangalore", "Karnataka")),
("Kavitha", ("Hyderabad", "Telangana"))]
df = spark.createDataFrame(data, schema)
# Retrieve data types of all columns
print(df.dtypes)
Output:
[('Name', 'string'), ('Address', 'struct<City:string,State:string>')]
Example 4: Retrieving Data Types of a DataFrame with Nullable Columns
PySpark:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
# Define schema with nullable and non-nullable columns
schema = StructType([
StructField("Name", StringType(), False), # Non-nullable
StructField("Age", IntegerType(), True) # Nullable
])
# Create DataFrame
data = [("Anand", 25), ("Bala", 30), ("Kavitha", 28), ("Raj", 35)]
df = spark.createDataFrame(data, schema)
# Retrieve data types of all columns
print(df.dtypes)
Output:
[('Name', 'string'), ('Age', 'int')]
Example 5: Retrieving Data Types of a DataFrame with Timestamps
PySpark:
from pyspark.sql.types import StructType, StructField, StringType, TimestampType
# Define schema with a timestamp column
schema = StructType([
StructField("Event", StringType(), True),
StructField("Timestamp", TimestampType(), True)
])
# Create DataFrame
data = [("Login", "2023-10-01 10:00:00"),
("Logout", "2023-10-01 12:00:00")]
df = spark.createDataFrame(data, schema)
# Retrieve data types of all columns
print(df.dtypes)
Output:
[('Event', 'string'), ('Timestamp', 'timestamp')]
Example 6: Retrieving Data Types of a DataFrame with Maps
PySpark:
from pyspark.sql.types import StructType, StructField, StringType, MapType, IntegerType
# Define schema with a map column
schema = StructType([
StructField("Name", StringType(), True),
StructField("Skills", MapType(StringType(), IntegerType()), True)
])
# Create DataFrame
data = [("Anand", {"Java": 5, "Python": 3}),
("Bala", {"Scala": 4, "Spark": 2})]
df = spark.createDataFrame(data, schema)
# Retrieve data types of all columns
print(df.dtypes)
Output:
[('Name', 'string'), ('Skills', 'map<string,int>')]
5. Common Use Cases
- Inspecting the schema of a DataFrame after reading data.
- Debugging schema mismatches or errors.
- Verifying the schema after transformations or joins.
- Using
dtypes is lightweight and does not involve data movement or processing.
- It is particularly useful for debugging and understanding the structure of complex DataFrames.
7. Key Takeaways
- Purpose: The
dtypes attribute is used to retrieve the schema of a DataFrame in the form of a list of tuples.
- Schema Inspection: It provides a quick way to inspect the schema and data types of all columns.
dtypes is a metadata operation and does not involve data processing, making it very efficient.
- In Spark SQL, similar functionality can be achieved using
DESCRIBE table_name.