Reference
Spark: explode function
The explode()
function in Spark is used to transform an array or map column into multiple rows. Each element in the array or map becomes a separate row in the resulting DataFrame. This is particularly useful when you have nested data structures (e.g., arrays or maps) and want to flatten them for analysis or processing.
1. Syntax
PySpark:
2. Parameters
- column: The array or map column to explode.
3. Return Type
- Returns a new DataFrame with the exploded column transformed into multiple rows.
4. Examples
Example 1: Exploding an Array Column
PySpark:
Spark SQL:
Output:
Example 2: Exploding a Map Column
PySpark:
Spark SQL:
Output:
Example 3: Exploding Multiple Array Columns
PySpark:
Spark SQL:
Output:
Example 4: Exploding with Other Columns
PySpark:
Spark SQL:
Output:
Example 5: Exploding with Position Using posexplode()
PySpark:
Spark SQL:
Output:
Example 6: Exploding Nested Arrays
PySpark:
Spark SQL:
Output:
Example 7: Exploding with Filtering
PySpark:
Spark SQL:
Output:
5. Common Use Cases
- Flattening JSON or nested data structures.
- Transforming arrays or maps into a tabular format for analysis.
- Preparing data for machine learning by creating feature vectors.
6. Performance Considerations
- Use
explode()
judiciously on large datasets, as it can significantly increase the number of rows. - Consider using
posexplode()
if you need to retain the original position of elements in the array.
7. Key Takeaways
- The
explode()
function is used to flatten array or map columns into multiple rows. - It can be used with both array and map columns.
- Exploding large arrays or maps can increase the size of the DataFrame, so use it judiciously.
- In Spark SQL, similar functionality can be achieved using
explode()
orLATERAL VIEW
. - Works efficiently on large datasets when combined with proper partitioning and caching.