Reference
Spark: pivot function
The pivot()
command in Spark is used to transform rows into columns, effectively rotating data from a long format to a wide format. This is particularly useful for creating summary tables or pivot tables, where you want to aggregate data and display it in a more readable format.
1. Syntax
PySpark:
Spark SQL:
2. Parameters
- grouping_cols: Columns to group by (rows in the resulting pivot table).
- pivot_col: The column whose unique values will become new columns in the pivot table.
- agg_func: The aggregation function to apply to the values (e.g.,
sum()
,count()
,avg()
).
3. Return Type
- Returns a new DataFrame with the pivoted data.
4. Examples
Example 1: Basic Pivot with Sum Aggregation
PySpark:
Spark SQL:
Output:
Example 2: Pivot with Multiple Grouping Columns
PySpark:
Spark SQL:
Output:
Example 3: Pivot with Multiple Aggregation Functions
PySpark:
Spark SQL:
Output:
Example 4: Pivot with Specified Pivot Column Values
PySpark:
Spark SQL:
Output:
Example 5: Pivot with Null Handling
PySpark:
Spark SQL:
Output:
Example 6: Pivot with Multiple Aggregations and Null Handling
PySpark:
Spark SQL:
Output:
5. Common Use Cases
- Summarizing data for reporting (e.g., sales by region, expenses by category).
- Preparing data for visualization (e.g., pivot tables in dashboards).
- Transforming data for machine learning (e.g., creating feature matrices).
6. Performance Considerations
- Use
pivot()
judiciously on large datasets, as it involves shuffling and sorting. - Specify pivot column values explicitly to reduce the number of unique values and improve performance.
- Use proper partitioning and indexing to optimize pivot operations.
7. Key Takeaways
- The
pivot()
command is used to transform rows into columns, creating a pivot table. - It supports grouping by multiple columns and applying various aggregation functions.
- Pivoting can be resource-intensive for large datasets, as it involves shuffling and sorting.
- In Spark SQL, similar functionality can be achieved using
CASE
statements and aggregation functions. - Works efficiently on large datasets when combined with proper partitioning and caching.