Reference
Spark: window functions
Window Functions in Spark
Window functions in Spark are used to perform calculations across a set of rows that are related to the current row. Unlike aggregate functions, which return a single value for a group of rows, window functions return a value for each row in the DataFrame. They are particularly useful for tasks like ranking, cumulative sums, moving averages, and more.
1. Syntax
PySpark:
Spark SQL:
2. Key Components
- Window Specification: Defines the partitioning, ordering, and frame for the window function.
partitionBy()
: Groups rows into partitions (similar toGROUP BY
).orderBy()
: Orders rows within each partition.rowsBetween()
orrangeBetween()
: Defines the window frame (e.g., current row, preceding rows, following rows).
- Window Functions:
- Ranking functions:
row_number()
,rank()
,dense_rank()
. - Analytic functions:
lead()
,lag()
. - Aggregate functions:
sum()
,avg()
,min()
,max()
.
- Ranking functions:
3. Common Window Functions
- Ranking Functions:
row_number()
: Assigns a unique sequential number to each row within a partition.rank()
: Assigns a rank to each row, with gaps for ties.dense_rank()
: Assigns a rank to each row, without gaps for ties.
- Analytic Functions:
lead()
: Accesses the value of a column in the next row.lag()
: Accesses the value of a column in the previous row.
- Aggregate Functions:
sum()
,avg()
,min()
,max()
: Perform aggregations over a window of rows.
4. Examples
Example 1: Using row_number()
for Ranking
PySpark:
Spark SQL:
Output:
Example 2: Using rank()
and dense_rank()
PySpark:
Spark SQL:
Output:
Example 3: Using lead()
and lag()
PySpark:
Spark SQL:
Output:
Example 4: Using Aggregate Functions with Window
PySpark:
Spark SQL:
Output:
Example 5: Using Window Frames
PySpark:
Spark SQL:
Output:
Example 6: Calculating a Moving Average
PySpark:
Spark SQL:
Output:
Example 7: Calculating Percentiles
PySpark:
Spark SQL:
Output:
5. Common Use Cases
- Ranking and row numbering (e.g., top N records).
- Calculating cumulative sums or moving averages.
- Accessing values from adjacent rows (e.g., comparing current and previous values).
6. Performance Considerations
- Use window functions judiciously on large datasets, as they involve shuffling and sorting.
- Optimize partitioning and ordering to reduce the size of the data processed in each window.
7. Key Takeaways
- Window functions perform calculations across a set of rows related to the current row.
- They support partitioning, ordering, and window frames for advanced calculations.
- Window functions can be resource-intensive for large datasets, as they involve shuffling and sorting.
- In Spark SQL, similar functionality can be achieved using
OVER
clauses. - Works efficiently on large datasets when combined with proper partitioning and caching.