The explode()
function in Spark is used to transform an array or map column into multiple rows. Each element in the array or map becomes a separate row in the resulting DataFrame. This is particularly useful when you have nested data structures (e.g., arrays or maps) and want to flatten them for analysis or processing.
1. Syntax
PySpark:
from pyspark.sql.functions import explode
explode(column)
2. Parameters
- column: The array or map column to explode.
3. Return Type
- Returns a new DataFrame with the exploded column transformed into multiple rows.
4. Examples
Example 1: Exploding an Array Column
PySpark:
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
spark = SparkSession.builder.appName("ExplodeExample").getOrCreate()
# Create DataFrame with an array column
data = [("Anand", ["Java", "Python"]),
("Bala", ["Scala", "Spark"]),
("Kavitha", ["SQL", "Hadoop"])]
columns = ["Name", "Skills"]
df = spark.createDataFrame(data, columns)
# Explode the 'Skills' array column
exploded_df = df.select("Name", explode("Skills").alias("Skill"))
exploded_df.show()
Spark SQL:
SELECT Name, explode(Skills) AS Skill
FROM people;
Output:
+-------+------+
| Name| Skill|
+-------+------+
| Anand| Java|
| Anand|Python|
| Bala| Scala|
| Bala| Spark|
|Kavitha| SQL|
|Kavitha|Hadoop|
+-------+------+
Example 2: Exploding a Map Column
PySpark:
# Create DataFrame with a map column
data = [("Anand", {"Java": 5, "Python": 3}),
("Bala", {"Scala": 4, "Spark": 2}),
("Kavitha", {"SQL": 5, "Hadoop": 1})]
columns = ["Name", "SkillLevel"]
df = spark.createDataFrame(data, columns)
# Explode the 'SkillLevel' map column
exploded_df = df.select("Name", explode("SkillLevel").alias("Skill", "Level"))
exploded_df.show()
Spark SQL:
SELECT Name, explode(SkillLevel) AS (Skill, Level)
FROM people;
Output:
+-------+------+-----+
| Name| Skill|Level|
+-------+------+-----+
| Anand| Java| 5|
| Anand|Python| 3|
| Bala| Scala| 4|
| Bala| Spark| 2|
|Kavitha| SQL| 5|
|Kavitha|Hadoop| 1|
+-------+------+-----+
Example 3: Exploding Multiple Array Columns
PySpark:
# Create DataFrame with multiple array columns
data = [("Anand", ["Java", "Python"], ["Data", "Science"]),
("Bala", ["Scala", "Spark"], ["Big", "Data"]),
("Kavitha", ["SQL", "Hadoop"], ["Analytics"])]
columns = ["Name", "Skills", "Domains"]
df = spark.createDataFrame(data, columns)
# Explode both 'Skills' and 'Domains' array columns
exploded_df = df.select("Name", explode("Skills").alias("Skill"), explode("Domains").alias("Domain"))
exploded_df.show()
Spark SQL:
SELECT Name, explode(Skills) AS Skill, explode(Domains) AS Domain
FROM people;
Output:
+-------+------+---------+
| Name| Skill| Domain|
+-------+------+---------+
| Anand| Java| Data|
| Anand| Java| Science|
| Anand|Python| Data|
| Anand|Python| Science|
| Bala| Scala| Big|
| Bala| Scala| Data|
| Bala| Spark| Big|
| Bala| Spark| Data|
|Kavitha| SQL|Analytics|
|Kavitha|Hadoop|Analytics|
+-------+------+---------+
Example 4: Exploding with Other Columns
PySpark:
# Explode 'Skills' while keeping other columns intact
exploded_df = df.select("Name", "Domains", explode("Skills").alias("Skill"))
exploded_df.show()
Spark SQL:
SELECT Name, Domains, explode(Skills) AS Skill
FROM people;
Output:
+-------+------------+------+
| Name| Domains| Skill|
+-------+------------+------+
| Anand|[Data, Science]| Java|
| Anand|[Data, Science]|Python|
| Bala| [Big, Data]| Scala|
| Bala| [Big, Data]| Spark|
|Kavitha| [Analytics]| SQL|
|Kavitha| [Analytics]|Hadoop|
+-------+------------+------+
Example 5: Exploding with Position Using posexplode()
PySpark:
from pyspark.sql.functions import posexplode
# Explode 'Skills' with position
exploded_df = df.select("Name", posexplode("Skills").alias("Position", "Skill"))
exploded_df.show()
Spark SQL:
SELECT Name, posexplode(Skills) AS (Position, Skill)
FROM people;
Output:
+-------+--------+------+
| Name|Position| Skill|
+-------+--------+------+
| Anand| 0| Java|
| Anand| 1|Python|
| Bala| 0| Scala|
| Bala| 1| Spark|
|Kavitha| 0| SQL|
|Kavitha| 1|Hadoop|
+-------+--------+------+
Example 6: Exploding Nested Arrays
PySpark:
# Create DataFrame with nested arrays
data = [("Anand", [["Java", "Python"], ["Data", "Science"]]),
("Bala", [["Scala", "Spark"], ["Big", "Data"]]),
("Kavitha", [["SQL", "Hadoop"], ["Analytics"]])]
columns = ["Name", "NestedSkills"]
df = spark.createDataFrame(data, columns)
# Explode nested arrays
exploded_df = df.select("Name", explode("NestedSkills").alias("Skills"))
exploded_df.show()
Spark SQL:
SELECT Name, explode(NestedSkills) AS Skills
FROM people;
Output:
+-------+------------+
| Name| Skills|
+-------+------------+
| Anand| [Java, Python]|
| Anand|[Data, Science]|
| Bala| [Scala, Spark]|
| Bala| [Big, Data]|
|Kavitha| [SQL, Hadoop]|
|Kavitha| [Analytics]|
+-------+------------+
Example 7: Exploding with Filtering
PySpark:
# Explode 'Skills' and filter rows where 'Skill' is 'Python'
exploded_df = df.select("Name", explode("Skills").alias("Skill")) \
.filter(col("Skill") == "Python")
exploded_df.show()
Spark SQL:
SELECT Name, explode(Skills) AS Skill
FROM people
WHERE Skill = 'Python';
Output:
+-------+------+
| Name| Skill|
+-------+------+
| Anand|Python|
+-------+------+
5. Common Use Cases
- Flattening JSON or nested data structures.
- Transforming arrays or maps into a tabular format for analysis.
- Preparing data for machine learning by creating feature vectors.
- Use
explode()
judiciously on large datasets, as it can significantly increase the number of rows.
- Consider using
posexplode()
if you need to retain the original position of elements in the array.
7. Key Takeaways
- The
explode()
function is used to flatten array or map columns into multiple rows.
- It can be used with both array and map columns.
- Exploding large arrays or maps can increase the size of the DataFrame, so use it judiciously.
- In Spark SQL, similar functionality can be achieved using
explode()
or LATERAL VIEW
.
- Works efficiently on large datasets when combined with proper partitioning and caching.