The describe()
function in Spark is used to compute summary statistics for numerical and string columns in a DataFrame. It provides a quick way to understand the distribution of data, including count, mean, standard deviation, minimum, and maximum values. This is particularly useful for exploratory data analysis (EDA) and data profiling.
1. Syntax
PySpark:
Spark SQL:
- There is no direct equivalent in Spark SQL, but you can compute similar statistics using aggregate functions.
2. Parameters
- cols: A list of column names (as strings) to describe. If no columns are specified, it computes statistics for all numerical and string columns.
3. Key Features
- Summary Statistics: Computes count, mean, standard deviation, minimum, and maximum values for numerical columns.
- String Columns: For string columns, it computes count, number of unique values, top frequent value, and its frequency.
- Efficient: It is optimized for large datasets and works in a distributed manner.
4. Examples
Example 1: Describing All Columns
PySpark:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("DescribeExample").getOrCreate()
# Create DataFrame
data = [("Anand", 25, 3000), ("Bala", 30, 4000), ("Kavitha", 28, 3500), ("Raj", 35, 4500)]
columns = ["Name", "Age", "Salary"]
df = spark.createDataFrame(data, columns)
# Describe all columns
df.describe().show()
Output:
+-------+------+------------------+------------------+
|summary| Name| Age| Salary|
+-------+------+------------------+------------------+
| count| 4| 4| 4|
| mean| null| 29.5| 3750.0|
| stddev| null|4.041451884327381 | 602.7713774422096|
| min| Anand| 25| 3000|
| max| Raj| 35| 4500|
+-------+------+------------------+------------------+
Example 2: Describing Specific Columns
PySpark:
# Describe only the 'Age' and 'Salary' columns
df.describe("Age", "Salary").show()
Output:
+-------+------------------+------------------+
|summary| Age| Salary|
+-------+------------------+------------------+
| count| 4| 4|
| mean| 29.5| 3750.0|
| stddev|4.041451884327381 | 602.7713774422096|
| min| 25| 3000|
| max| 35| 4500|
+-------+------------------+------------------+
Example 3: Describing String Columns
PySpark:
# Describe only the 'Name' column (string column)
df.describe("Name").show()
Output:
+-------+-----+
|summary| Name|
+-------+-----+
| count| 4|
| mean| null|
| stddev| null|
| min|Anand|
| max| Raj|
+-------+-----+
Example 4: Describing a DataFrame with Null Values
PySpark:
# Create DataFrame with null values
data = [("Anand", 25, 3000), ("Bala", None, 4000), ("Kavitha", 28, None), ("Raj", 35, 4500)]
columns = ["Name", "Age", "Salary"]
df = spark.createDataFrame(data, columns)
# Describe all columns
df.describe().show()
Output:
+-------+------+------------------+------------------+
|summary| Name| Age| Salary|
+-------+------+------------------+------------------+
| count| 4| 3| 3|
| mean| null|29.333333333333332|3833.3333333333335|
| stddev| null|4.725815626252093 | 763.7626158259734|
| min| Anand| 25| 3000|
| max| Raj| 35| 4500|
+-------+------+------------------+------------------+
Example 5: Describing a DataFrame with Nested Structures
PySpark:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, ArrayType
# Define schema with nested structures
schema = StructType([
StructField("Name", StringType(), True),
StructField("Age", IntegerType(), True),
StructField("Skills", ArrayType(StringType()), True)
])
# Create DataFrame with nested data
data = [("Anand", 25, ["Java", "Python"]),
("Bala", 30, ["Scala", "Spark"]),
("Kavitha", 28, ["SQL", "Hadoop"])]
df = spark.createDataFrame(data, schema)
# Describe all columns
df.describe().show()
Output:
+-------+------+------------------+--------------------+
|summary| Name| Age| Skills|
+-------+------+------------------+--------------------+
| count| 4| 4| 4|
| mean| null| 29.5| null|
| stddev| null|4.041451884327381 | null|
| min| Anand| 25|[Hadoop, Java, Py...|
| max| Raj| 35|[Scala, Spark, SQ...|
+-------+------+------------------+--------------------+
Example 6: Describing a DataFrame with Timestamps
PySpark:
from pyspark.sql.types import StructType, StructField, StringType, TimestampType
# Define schema with a timestamp column
schema = StructType([
StructField("Event", StringType(), True),
StructField("Timestamp", TimestampType(), True)
])
# Create DataFrame
data = [("Login", "2023-10-01 10:00:00"),
("Logout", "2023-10-01 12:00:00")]
df = spark.createDataFrame(data, schema)
# Describe all columns
df.describe().show()
Output:
+-------+-----+--------------------+
|summary|Event| Timestamp|
+-------+-----+--------------------+
| count| 2| 2|
| mean| null|2023-10-01 11:00:...|
| stddev| null| 1.414213562373095h|
| min|Login|2023-10-01 10:00:00|
| max|Logout|2023-10-01 12:00:00|
+-------+-----+--------------------+
Example 7: Describing a DataFrame with Maps
PySpark:
from pyspark.sql.types import StructType, StructField, StringType, MapType, IntegerType
# Define schema with a map column
schema = StructType([
StructField("Name", StringType(), True),
StructField("Skills", MapType(StringType(), IntegerType()), True)
])
# Create DataFrame
data = [("Anand", {"Java": 5, "Python": 3}),
("Bala", {"Scala": 4, "Spark": 2})]
df = spark.createDataFrame(data, schema)
# Describe all columns
df.describe().show()
Output:
+-------+------+--------------------+
|summary| Name| Skills|
+-------+------+--------------------+
| count| 2| 2|
| mean| null| null|
| stddev| null| null|
| min| Anand|[Java -> 5, Pytho...|
| max| Bala|[Scala -> 4, Spar...|
+-------+------+--------------------+
5. Common Use Cases
- Understanding the distribution of numerical data (e.g., mean, standard deviation).
- Profiling string data (e.g., unique values, top frequent values).
- Identifying missing values or outliers.
describe()
is efficient for large datasets as it computes statistics in a distributed manner.
- Use it judiciously for very wide DataFrames (many columns), as it processes all specified columns.
7. Key Takeaways
- Purpose: The
describe()
function is used to compute summary statistics for numerical and string columns in a DataFrame.
- Summary Statistics: It provides count, mean, standard deviation, minimum, and maximum values for numerical columns.
- String Columns: For string columns, it computes count, number of unique values, top frequent value, and its frequency.