The explode() function in Spark is used to transform an array or map column into multiple rows. Each element in the array or map becomes a separate row in the resulting DataFrame. This is particularly useful when you have nested data structures (e.g., arrays or maps) and want to flatten them for analysis or processing.


1. Syntax

PySpark:

from pyspark.sql.functions import explode

explode(column)

2. Parameters

  • column: The array or map column to explode.

3. Return Type

  • Returns a new DataFrame with the exploded column transformed into multiple rows.

4. Examples

Example 1: Exploding an Array Column

PySpark:

from pyspark.sql import SparkSession
from pyspark.sql.functions import explode

spark = SparkSession.builder.appName("ExplodeExample").getOrCreate()

# Create DataFrame with an array column
data = [("Anand", ["Java", "Python"]), 
        ("Bala", ["Scala", "Spark"]), 
        ("Kavitha", ["SQL", "Hadoop"])]
columns = ["Name", "Skills"]

df = spark.createDataFrame(data, columns)

# Explode the 'Skills' array column
exploded_df = df.select("Name", explode("Skills").alias("Skill"))
exploded_df.show()

Spark SQL:

SELECT Name, explode(Skills) AS Skill 
FROM people;

Output:

+-------+------+
|   Name| Skill|
+-------+------+
|  Anand|  Java|
|  Anand|Python|
|   Bala| Scala|
|   Bala| Spark|
|Kavitha|   SQL|
|Kavitha|Hadoop|
+-------+------+

Example 2: Exploding a Map Column

PySpark:

# Create DataFrame with a map column
data = [("Anand", {"Java": 5, "Python": 3}), 
        ("Bala", {"Scala": 4, "Spark": 2}), 
        ("Kavitha", {"SQL": 5, "Hadoop": 1})]
columns = ["Name", "SkillLevel"]

df = spark.createDataFrame(data, columns)

# Explode the 'SkillLevel' map column
exploded_df = df.select("Name", explode("SkillLevel").alias("Skill", "Level"))
exploded_df.show()

Spark SQL:

SELECT Name, explode(SkillLevel) AS (Skill, Level) 
FROM people;

Output:

+-------+------+-----+
|   Name| Skill|Level|
+-------+------+-----+
|  Anand|  Java|    5|
|  Anand|Python|    3|
|   Bala| Scala|    4|
|   Bala| Spark|    2|
|Kavitha|   SQL|    5|
|Kavitha|Hadoop|    1|
+-------+------+-----+

Example 3: Exploding Multiple Array Columns

PySpark:

# Create DataFrame with multiple array columns
data = [("Anand", ["Java", "Python"], ["Data", "Science"]), 
        ("Bala", ["Scala", "Spark"], ["Big", "Data"]), 
        ("Kavitha", ["SQL", "Hadoop"], ["Analytics"])]
columns = ["Name", "Skills", "Domains"]

df = spark.createDataFrame(data, columns)

# Explode both 'Skills' and 'Domains' array columns
exploded_df = df.select("Name", explode("Skills").alias("Skill"), explode("Domains").alias("Domain"))
exploded_df.show()

Spark SQL:

SELECT Name, explode(Skills) AS Skill, explode(Domains) AS Domain 
FROM people;

Output:

+-------+------+---------+
|   Name| Skill|   Domain|
+-------+------+---------+
|  Anand|  Java|     Data|
|  Anand|  Java|  Science|
|  Anand|Python|     Data|
|  Anand|Python|  Science|
|   Bala| Scala|      Big|
|   Bala| Scala|     Data|
|   Bala| Spark|      Big|
|   Bala| Spark|     Data|
|Kavitha|   SQL|Analytics|
|Kavitha|Hadoop|Analytics|
+-------+------+---------+

Example 4: Exploding with Other Columns

PySpark:

# Explode 'Skills' while keeping other columns intact
exploded_df = df.select("Name", "Domains", explode("Skills").alias("Skill"))
exploded_df.show()

Spark SQL:

SELECT Name, Domains, explode(Skills) AS Skill 
FROM people;

Output:

+-------+------------+------+
|   Name|     Domains| Skill|
+-------+------------+------+
|  Anand|[Data, Science]|  Java|
|  Anand|[Data, Science]|Python|
|   Bala|  [Big, Data]| Scala|
|   Bala|  [Big, Data]| Spark|
|Kavitha| [Analytics]|   SQL|
|Kavitha| [Analytics]|Hadoop|
+-------+------------+------+

Example 5: Exploding with Position Using posexplode()

PySpark:

from pyspark.sql.functions import posexplode

# Explode 'Skills' with position
exploded_df = df.select("Name", posexplode("Skills").alias("Position", "Skill"))
exploded_df.show()

Spark SQL:

SELECT Name, posexplode(Skills) AS (Position, Skill) 
FROM people;

Output:

+-------+--------+------+
|   Name|Position| Skill|
+-------+--------+------+
|  Anand|       0|  Java|
|  Anand|       1|Python|
|   Bala|       0| Scala|
|   Bala|       1| Spark|
|Kavitha|       0|   SQL|
|Kavitha|       1|Hadoop|
+-------+--------+------+

Example 6: Exploding Nested Arrays

PySpark:

# Create DataFrame with nested arrays
data = [("Anand", [["Java", "Python"], ["Data", "Science"]]), 
        ("Bala", [["Scala", "Spark"], ["Big", "Data"]]), 
        ("Kavitha", [["SQL", "Hadoop"], ["Analytics"]])]
columns = ["Name", "NestedSkills"]

df = spark.createDataFrame(data, columns)

# Explode nested arrays
exploded_df = df.select("Name", explode("NestedSkills").alias("Skills"))
exploded_df.show()

Spark SQL:

SELECT Name, explode(NestedSkills) AS Skills 
FROM people;

Output:

+-------+------------+
|   Name|      Skills|
+-------+------------+
|  Anand| [Java, Python]|
|  Anand|[Data, Science]|
|   Bala| [Scala, Spark]|
|   Bala|  [Big, Data]|
|Kavitha|  [SQL, Hadoop]|
|Kavitha| [Analytics]|
+-------+------------+

Example 7: Exploding with Filtering

PySpark:

# Explode 'Skills' and filter rows where 'Skill' is 'Python'
exploded_df = df.select("Name", explode("Skills").alias("Skill")) \
                .filter(col("Skill") == "Python")
exploded_df.show()

Spark SQL:

SELECT Name, explode(Skills) AS Skill 
FROM people 
WHERE Skill = 'Python';

Output:

+-------+------+
|   Name| Skill|
+-------+------+
|  Anand|Python|
+-------+------+

5. Common Use Cases

  • Flattening JSON or nested data structures.
  • Transforming arrays or maps into a tabular format for analysis.
  • Preparing data for machine learning by creating feature vectors.

6. Performance Considerations

  • Use explode() judiciously on large datasets, as it can significantly increase the number of rows.
  • Consider using posexplode() if you need to retain the original position of elements in the array.

7. Key Takeaways

  1. The explode() function is used to flatten array or map columns into multiple rows.
  2. It can be used with both array and map columns.
  3. Exploding large arrays or maps can increase the size of the DataFrame, so use it judiciously.
  4. In Spark SQL, similar functionality can be achieved using explode() or LATERAL VIEW.
  5. Works efficiently on large datasets when combined with proper partitioning and caching.