Reference
Spark: dtypes attribute
The dtypes
attribute in Spark is used to retrieve the schema of a DataFrame in the form of a list of tuples, where each tuple contains the column name and its corresponding data type. This is particularly useful for inspecting the structure of the data and understanding the data types of each column.
1. Syntax
PySpark:
Spark SQL:
- There is no direct equivalent in Spark SQL, but you can use
DESCRIBE table_name
to achieve similar results.
2. Return Type
- Returns a list of tuples, where each tuple contains:
- Column Name: The name of the column.
- Data Type: The data type of the column (e.g.,
string
,int
,double
).
3. Key Features
- Schema Inspection: Provides a quick way to inspect the schema of a DataFrame.
- Data Types: Lists the data types of all columns in the DataFrame.
- Efficient: It is a metadata operation and does not involve data processing.
4. Examples
Example 1: Retrieving Data Types of All Columns
PySpark:
Output:
Example 2: Retrieving Data Types of a DataFrame with Nested Structures
PySpark:
Output:
Example 3: Retrieving Data Types of a DataFrame with Structs
PySpark:
Output:
Example 4: Retrieving Data Types of a DataFrame with Nullable Columns
PySpark:
Output:
Example 5: Retrieving Data Types of a DataFrame with Timestamps
PySpark:
Output:
Example 6: Retrieving Data Types of a DataFrame with Maps
PySpark:
Output:
5. Common Use Cases
- Inspecting the schema of a DataFrame after reading data.
- Debugging schema mismatches or errors.
- Verifying the schema after transformations or joins.
6. Performance Considerations
- Using
dtypes
is lightweight and does not involve data movement or processing. - It is particularly useful for debugging and understanding the structure of complex DataFrames.
7. Key Takeaways
- Purpose: The
dtypes
attribute is used to retrieve the schema of a DataFrame in the form of a list of tuples. - Schema Inspection: It provides a quick way to inspect the schema and data types of all columns.
dtypes
is a metadata operation and does not involve data processing, making it very efficient.- In Spark SQL, similar functionality can be achieved using
DESCRIBE table_name
.