The describe() function in Spark is used to compute summary statistics for numerical and string columns in a DataFrame. It provides a quick way to understand the distribution of data, including count, mean, standard deviation, minimum, and maximum values. This is particularly useful for exploratory data analysis (EDA) and data profiling.


1. Syntax

PySpark:

df.describe(*cols)

Spark SQL:

  • There is no direct equivalent in Spark SQL, but you can compute similar statistics using aggregate functions.

2. Parameters

  • cols: A list of column names (as strings) to describe. If no columns are specified, it computes statistics for all numerical and string columns.

3. Key Features

  • Summary Statistics: Computes count, mean, standard deviation, minimum, and maximum values for numerical columns.
  • String Columns: For string columns, it computes count, number of unique values, top frequent value, and its frequency.
  • Efficient: It is optimized for large datasets and works in a distributed manner.

4. Examples

Example 1: Describing All Columns

PySpark:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("DescribeExample").getOrCreate()

# Create DataFrame
data = [("Anand", 25, 3000), ("Bala", 30, 4000), ("Kavitha", 28, 3500), ("Raj", 35, 4500)]
columns = ["Name", "Age", "Salary"]

df = spark.createDataFrame(data, columns)

# Describe all columns
df.describe().show()

Output:

+-------+------+------------------+------------------+
|summary|  Name|               Age|            Salary|
+-------+------+------------------+------------------+
|  count|     4|                 4|                 4|
|   mean|  null|              29.5|            3750.0|
| stddev|  null|4.041451884327381 | 602.7713774422096|
|    min| Anand|                25|              3000|
|    max|   Raj|                35|              4500|
+-------+------+------------------+------------------+

Example 2: Describing Specific Columns

PySpark:

# Describe only the 'Age' and 'Salary' columns
df.describe("Age", "Salary").show()

Output:

+-------+------------------+------------------+
|summary|               Age|            Salary|
+-------+------------------+------------------+
|  count|                 4|                 4|
|   mean|              29.5|            3750.0|
| stddev|4.041451884327381 | 602.7713774422096|
|    min|                25|              3000|
|    max|                35|              4500|
+-------+------------------+------------------+

Example 3: Describing String Columns

PySpark:

# Describe only the 'Name' column (string column)
df.describe("Name").show()

Output:

+-------+-----+
|summary| Name|
+-------+-----+
|  count|    4|
|   mean| null|
| stddev| null|
|    min|Anand|
|    max|  Raj|
+-------+-----+

Example 4: Describing a DataFrame with Null Values

PySpark:

# Create DataFrame with null values
data = [("Anand", 25, 3000), ("Bala", None, 4000), ("Kavitha", 28, None), ("Raj", 35, 4500)]
columns = ["Name", "Age", "Salary"]

df = spark.createDataFrame(data, columns)

# Describe all columns
df.describe().show()

Output:

+-------+------+------------------+------------------+
|summary|  Name|               Age|            Salary|
+-------+------+------------------+------------------+
|  count|     4|                 3|                 3|
|   mean|  null|29.333333333333332|3833.3333333333335|
| stddev|  null|4.725815626252093 | 763.7626158259734|
|    min| Anand|                25|              3000|
|    max|   Raj|                35|              4500|
+-------+------+------------------+------------------+

Example 5: Describing a DataFrame with Nested Structures

PySpark:

from pyspark.sql.types import StructType, StructField, StringType, IntegerType, ArrayType

# Define schema with nested structures
schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Age", IntegerType(), True),
    StructField("Skills", ArrayType(StringType()), True)
])

# Create DataFrame with nested data
data = [("Anand", 25, ["Java", "Python"]), 
        ("Bala", 30, ["Scala", "Spark"]), 
        ("Kavitha", 28, ["SQL", "Hadoop"])]
df = spark.createDataFrame(data, schema)

# Describe all columns
df.describe().show()

Output:

+-------+------+------------------+--------------------+
|summary|  Name|               Age|              Skills|
+-------+------+------------------+--------------------+
|  count|     4|                 4|                   4|
|   mean|  null|              29.5|                null|
| stddev|  null|4.041451884327381 |                null|
|    min| Anand|                25|[Hadoop, Java, Py...|
|    max|   Raj|                35|[Scala, Spark, SQ...|
+-------+------+------------------+--------------------+

Example 6: Describing a DataFrame with Timestamps

PySpark:

from pyspark.sql.types import StructType, StructField, StringType, TimestampType

# Define schema with a timestamp column
schema = StructType([
    StructField("Event", StringType(), True),
    StructField("Timestamp", TimestampType(), True)
])

# Create DataFrame
data = [("Login", "2023-10-01 10:00:00"), 
        ("Logout", "2023-10-01 12:00:00")]
df = spark.createDataFrame(data, schema)

# Describe all columns
df.describe().show()

Output:

+-------+-----+--------------------+
|summary|Event|           Timestamp|
+-------+-----+--------------------+
|  count|    2|                   2|
|   mean| null|2023-10-01 11:00:...|
| stddev| null|  1.414213562373095h|
|    min|Login|2023-10-01 10:00:00|
|    max|Logout|2023-10-01 12:00:00|
+-------+-----+--------------------+

Example 7: Describing a DataFrame with Maps

PySpark:

from pyspark.sql.types import StructType, StructField, StringType, MapType, IntegerType

# Define schema with a map column
schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Skills", MapType(StringType(), IntegerType()), True)
])

# Create DataFrame
data = [("Anand", {"Java": 5, "Python": 3}), 
        ("Bala", {"Scala": 4, "Spark": 2})]
df = spark.createDataFrame(data, schema)

# Describe all columns
df.describe().show()

Output:

+-------+------+--------------------+
|summary|  Name|              Skills|
+-------+------+--------------------+
|  count|     2|                   2|
|   mean|  null|                null|
| stddev|  null|                null|
|    min| Anand|[Java -> 5, Pytho...|
|    max|  Bala|[Scala -> 4, Spar...|
+-------+------+--------------------+

5. Common Use Cases

  • Understanding the distribution of numerical data (e.g., mean, standard deviation).
  • Profiling string data (e.g., unique values, top frequent values).
  • Identifying missing values or outliers.

6. Performance Considerations

  • describe() is efficient for large datasets as it computes statistics in a distributed manner.
  • Use it judiciously for very wide DataFrames (many columns), as it processes all specified columns.

7. Key Takeaways

  1. Purpose: The describe() function is used to compute summary statistics for numerical and string columns in a DataFrame.
  2. Summary Statistics: It provides count, mean, standard deviation, minimum, and maximum values for numerical columns.
  3. String Columns: For string columns, it computes count, number of unique values, top frequent value, and its frequency.