The orderBy()
and sort()
commands in Spark are used to sort the rows of a DataFrame based on one or more columns. Both commands are interchangeable and can be used to achieve the same result. Sorting is essential for organizing data in a specific order, such as ascending or descending, for analysis or reporting.
orderBy() is a method of the DataFrame class.
1. Syntax
PySpark:
df.orderBy(*cols, ascending=True)
df.sort(*cols, ascending=True)
Spark SQL:
SELECT * FROM table_name
ORDER BY col1 [ASC|DESC], col2 [ASC|DESC], ...;
2. Parameters
- cols: A list of column names (as strings) or column objects to sort by.
- ascending: A boolean or list of booleans specifying the sort order. Default is
True
(ascending).
3. Return Type
- Returns a new DataFrame with rows sorted based on the specified columns.
4. Examples
Example 1: Sorting by a Single Column in Ascending Order
PySpark:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("SortExample").getOrCreate()
# Create DataFrame
data = [("Anand", 25), ("Bala", 30), ("Kavitha", 28), ("Raj", 35)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)
# Sort by 'Age' in ascending order
sorted_df = df.orderBy("Age")
sorted_df.show()
Spark SQL:
SELECT * FROM people
ORDER BY Age ASC;
Output:
+-------+---+
| Name|Age|
+-------+---+
| Anand| 25|
|Kavitha| 28|
| Bala| 30|
| Raj| 35|
+-------+---+
Example 2: Sorting by a Single Column in Descending Order
PySpark:
# Sort by 'Age' in descending order
sorted_df = df.orderBy("Age", ascending=False)
sorted_df.show()
Spark SQL:
SELECT * FROM people
ORDER BY Age DESC;
Output:
+-------+---+
| Name|Age|
+-------+---+
| Raj| 35|
| Bala| 30|
|Kavitha| 28|
| Anand| 25|
+-------+---+
Example 3: Sorting by Multiple Columns
PySpark:
# Add a 'Salary' column
data = [("Anand", 25, 3000), ("Bala", 30, 4000), ("Kavitha", 28, 3500), ("Raj", 35, 4500)]
columns = ["Name", "Age", "Salary"]
df = spark.createDataFrame(data, columns)
# Sort by 'Age' in ascending order and 'Salary' in descending order
sorted_df = df.orderBy(["Age", "Salary"], ascending=[True, False])
sorted_df.show()
Spark SQL:
SELECT * FROM people
ORDER BY Age ASC, Salary DESC;
Output:
+-------+---+------+
| Name|Age|Salary|
+-------+---+------+
| Anand| 25| 3000|
|Kavitha| 28| 3500|
| Bala| 30| 4000|
| Raj| 35| 4500|
+-------+---+------+
Example 4: Using sort()
Instead of orderBy()
PySpark:
# Sort by 'Age' in ascending order using `sort()`
sorted_df = df.sort("Age")
sorted_df.show()
Spark SQL:
SELECT * FROM people
ORDER BY Age ASC;
Output:
+-------+---+
| Name|Age|
+-------+---+
| Anand| 25|
|Kavitha| 28|
| Bala| 30|
| Raj| 35|
+-------+---+
Example 5: Sorting with Null Values
PySpark:
# Add a row with a null value
data = [("Anand", 25), ("Bala", None), ("Kavitha", 28), ("Raj", 35)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)
# Sort by 'Age' with nulls first
sorted_df = df.orderBy("Age", ascending=True, nullsFirst=True)
sorted_df.show()
Spark SQL:
SELECT * FROM people
ORDER BY Age ASC NULLS FIRST;
Output:
+-------+----+
| Name| Age|
+-------+----+
| Bala|null|
| Anand| 25|
|Kavitha| 28|
| Raj| 35|
+-------+----+
Example 6: Sorting by Multiple Columns with Mixed Order
PySpark:
# Sort by 'Age' in ascending order and 'Salary' in descending order
sorted_df = df.orderBy(["Age", "Salary"], ascending=[True, False])
sorted_df.show()
Spark SQL:
SELECT * FROM people
ORDER BY Age ASC, Salary DESC;
Output:
+-------+---+------+
| Name|Age|Salary|
+-------+---+------+
| Anand| 25| 3000|
|Kavitha| 28| 3500|
| Bala| 30| 4000|
| Raj| 35| 4500|
+-------+---+------+
Example 7: Sorting with Nulls Last
PySpark:
# Sort by 'Age' with nulls last
sorted_df = df.orderBy("Age", ascending=True, nullsLast=True)
sorted_df.show()
Spark SQL:
SELECT * FROM people
ORDER BY Age ASC NULLS LAST;
Output:
+-------+----+
| Name| Age|
+-------+----+
| Anand| 25|
|Kavitha| 28|
| Raj| 35|
| Bala|null|
+-------+----+
Example 8: Sorting by Expression
PySpark:
from pyspark.sql.functions import col
# Sort by the length of the 'Name' column
sorted_df = df.orderBy(col("Name").asc())
sorted_df.show()
Spark SQL:
SELECT * FROM people
ORDER BY LENGTH(Name) ASC;
Output:
+-------+---+
| Name|Age|
+-------+---+
| Raj| 35|
| Anand| 25|
| Bala| 30|
|Kavitha| 28|
+-------+---+
5. Common Use Cases
- Sorting data for display in reports or dashboards.
- Preparing data for machine learning by ordering features or labels.
- Sorting data before performing window operations (e.g., ranking, cumulative sums).
- Use
orderBy()
or sort()
judiciously on large datasets, as it involves shuffling and sorting.
- Consider using
repartition()
or coalesce()
to optimize performance when working with large datasets.
7. Key Takeaways
- The
orderBy()
and sort()
commands are used to sort the rows of a DataFrame based on one or more columns.
- Both commands are interchangeable and support sorting in ascending or descending order.
- In Spark SQL, similar functionality can be achieved using
ORDER BY
.
- Works efficiently on large datasets when combined with proper partitioning and caching.