Reference
Spark: describe function
The describe()
function in Spark is used to compute summary statistics for numerical and string columns in a DataFrame. It provides a quick way to understand the distribution of data, including count, mean, standard deviation, minimum, and maximum values. This is particularly useful for exploratory data analysis (EDA) and data profiling.
1. Syntax
PySpark:
Spark SQL:
- There is no direct equivalent in Spark SQL, but you can compute similar statistics using aggregate functions.
2. Parameters
- cols: A list of column names (as strings) to describe. If no columns are specified, it computes statistics for all numerical and string columns.
3. Key Features
- Summary Statistics: Computes count, mean, standard deviation, minimum, and maximum values for numerical columns.
- String Columns: For string columns, it computes count, number of unique values, top frequent value, and its frequency.
- Efficient: It is optimized for large datasets and works in a distributed manner.
4. Examples
Example 1: Describing All Columns
PySpark:
Output:
Example 2: Describing Specific Columns
PySpark:
Output:
Example 3: Describing String Columns
PySpark:
Output:
Example 4: Describing a DataFrame with Null Values
PySpark:
Output:
Example 5: Describing a DataFrame with Nested Structures
PySpark:
Output:
Example 6: Describing a DataFrame with Timestamps
PySpark:
Output:
Example 7: Describing a DataFrame with Maps
PySpark:
Output:
5. Common Use Cases
- Understanding the distribution of numerical data (e.g., mean, standard deviation).
- Profiling string data (e.g., unique values, top frequent values).
- Identifying missing values or outliers.
6. Performance Considerations
describe()
is efficient for large datasets as it computes statistics in a distributed manner.- Use it judiciously for very wide DataFrames (many columns), as it processes all specified columns.
7. Key Takeaways
- Purpose: The
describe()
function is used to compute summary statistics for numerical and string columns in a DataFrame. - Summary Statistics: It provides count, mean, standard deviation, minimum, and maximum values for numerical columns.
- String Columns: For string columns, it computes count, number of unique values, top frequent value, and its frequency.