Spark: describe function

The describe() function in Spark is used to compute summary statistics for numerical and string columns in a DataFrame. It provides a quick way to understand the distribution of data, including count, mean, standard deviation, minimum, and maximum values. This is particularly useful for exploratory data analysis (EDA) and data profiling.

1. Syntax

PySpark:

df.describe(*cols)

Spark SQL:

There is no direct equivalent in Spark SQL, but you can compute similar statistics using aggregate functions.

2. Parameters

cols: A list of column names (as strings) to describe. If no columns are specified, it computes statistics for all numerical and string columns.

3. Key Features

Summary Statistics: Computes count, mean, standard deviation, minimum, and maximum values for numerical columns.
String Columns: For string columns, it computes count, number of unique values, top frequent value, and its frequency.
Efficient: It is optimized for large datasets and works in a distributed manner.

4. Examples

Example 1: Describing All Columns

PySpark:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("DescribeExample").getOrCreate()

# Create DataFrame
data = [("Anand", 25, 3000), ("Bala", 30, 4000), ("Kavitha", 28, 3500), ("Raj", 35, 4500)]
columns = ["Name", "Age", "Salary"]

df = spark.createDataFrame(data, columns)

# Describe all columns
df.describe().show()

Output:

+-------+------+------------------+------------------+
|summary|  Name|               Age|            Salary|
+-------+------+------------------+------------------+
|  count|     4|                 4|                 4|
|   mean|  null|              29.5|            3750.0|
| stddev|  null|4.041451884327381 | 602.7713774422096|
|    min| Anand|                25|              3000|
|    max|   Raj|                35|              4500|
+-------+------+------------------+------------------+

Example 2: Describing Specific Columns

PySpark:

# Describe only the 'Age' and 'Salary' columns
df.describe("Age", "Salary").show()

Output:

+-------+------------------+------------------+
|summary|               Age|            Salary|
+-------+------------------+------------------+
|  count|                 4|                 4|
|   mean|              29.5|            3750.0|
| stddev|4.041451884327381 | 602.7713774422096|
|    min|                25|              3000|
|    max|                35|              4500|
+-------+------------------+------------------+

Example 3: Describing String Columns

PySpark:

# Describe only the 'Name' column (string column)
df.describe("Name").show()

Output:

+-------+-----+
|summary| Name|
+-------+-----+
|  count|    4|
|   mean| null|
| stddev| null|
|    min|Anand|
|    max|  Raj|
+-------+-----+

Example 4: Describing a DataFrame with Null Values

PySpark:

# Create DataFrame with null values
data = [("Anand", 25, 3000), ("Bala", None, 4000), ("Kavitha", 28, None), ("Raj", 35, 4500)]
columns = ["Name", "Age", "Salary"]

df = spark.createDataFrame(data, columns)

# Describe all columns
df.describe().show()

Output:

+-------+------+------------------+------------------+
|summary|  Name|               Age|            Salary|
+-------+------+------------------+------------------+
|  count|     4|                 3|                 3|
|   mean|  null|29.333333333333332|3833.3333333333335|
| stddev|  null|4.725815626252093 | 763.7626158259734|
|    min| Anand|                25|              3000|
|    max|   Raj|                35|              4500|
+-------+------+------------------+------------------+

Example 5: Describing a DataFrame with Nested Structures

PySpark:

from pyspark.sql.types import StructType, StructField, StringType, IntegerType, ArrayType

# Define schema with nested structures
schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Age", IntegerType(), True),
    StructField("Skills", ArrayType(StringType()), True)
])

# Create DataFrame with nested data
data = [("Anand", 25, ["Java", "Python"]), 
        ("Bala", 30, ["Scala", "Spark"]), 
        ("Kavitha", 28, ["SQL", "Hadoop"])]
df = spark.createDataFrame(data, schema)

# Describe all columns
df.describe().show()

Output:

+-------+------+------------------+--------------------+
|summary|  Name|               Age|              Skills|
+-------+------+------------------+--------------------+
|  count|     4|                 4|                   4|
|   mean|  null|              29.5|                null|
| stddev|  null|4.041451884327381 |                null|
|    min| Anand|                25|[Hadoop, Java, Py...|
|    max|   Raj|                35|[Scala, Spark, SQ...|
+-------+------+------------------+--------------------+

Example 6: Describing a DataFrame with Timestamps

PySpark:

from pyspark.sql.types import StructType, StructField, StringType, TimestampType

# Define schema with a timestamp column
schema = StructType([
    StructField("Event", StringType(), True),
    StructField("Timestamp", TimestampType(), True)
])

# Create DataFrame
data = [("Login", "2023-10-01 10:00:00"), 
        ("Logout", "2023-10-01 12:00:00")]
df = spark.createDataFrame(data, schema)

# Describe all columns
df.describe().show()

Output:

+-------+-----+--------------------+
|summary|Event|           Timestamp|
+-------+-----+--------------------+
|  count|    2|                   2|
|   mean| null|2023-10-01 11:00:...|
| stddev| null|  1.414213562373095h|
|    min|Login|2023-10-01 10:00:00|
|    max|Logout|2023-10-01 12:00:00|
+-------+-----+--------------------+

Example 7: Describing a DataFrame with Maps

PySpark:

from pyspark.sql.types import StructType, StructField, StringType, MapType, IntegerType

# Define schema with a map column
schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Skills", MapType(StringType(), IntegerType()), True)
])

# Create DataFrame
data = [("Anand", {"Java": 5, "Python": 3}), 
        ("Bala", {"Scala": 4, "Spark": 2})]
df = spark.createDataFrame(data, schema)

# Describe all columns
df.describe().show()

Output:

+-------+------+--------------------+
|summary|  Name|              Skills|
+-------+------+--------------------+
|  count|     2|                   2|
|   mean|  null|                null|
| stddev|  null|                null|
|    min| Anand|[Java -> 5, Pytho...|
|    max|  Bala|[Scala -> 4, Spar...|
+-------+------+--------------------+

5. Common Use Cases

Understanding the distribution of numerical data (e.g., mean, standard deviation).
Profiling string data (e.g., unique values, top frequent values).
Identifying missing values or outliers.

6. Performance Considerations

describe() is efficient for large datasets as it computes statistics in a distributed manner.
Use it judiciously for very wide DataFrames (many columns), as it processes all specified columns.

7. Key Takeaways

Purpose: The describe() function is used to compute summary statistics for numerical and string columns in a DataFrame.
Summary Statistics: It provides count, mean, standard deviation, minimum, and maximum values for numerical columns.
String Columns: For string columns, it computes count, number of unique values, top frequent value, and its frequency.

​1. Syntax

​2. Parameters

​3. Key Features

​4. Examples

​Example 1: Describing All Columns

​Example 2: Describing Specific Columns

​Example 3: Describing String Columns

​Example 4: Describing a DataFrame with Null Values

​Example 5: Describing a DataFrame with Nested Structures

​Example 6: Describing a DataFrame with Timestamps

​Example 7: Describing a DataFrame with Maps

​5. Common Use Cases

​6. Performance Considerations

​7. Key Takeaways

1. Syntax

2. Parameters

3. Key Features

4. Examples

Example 1: Describing All Columns

Example 2: Describing Specific Columns

Example 3: Describing String Columns

Example 4: Describing a DataFrame with Null Values

Example 5: Describing a DataFrame with Nested Structures

Example 6: Describing a DataFrame with Timestamps

Example 7: Describing a DataFrame with Maps

5. Common Use Cases

6. Performance Considerations

7. Key Takeaways