The orderBy() and sort() commands in Spark are used to sort the rows of a DataFrame based on one or more columns. Both commands are interchangeable and can be used to achieve the same result. Sorting is essential for organizing data in a specific order, such as ascending or descending, for analysis or reporting. orderBy() is a method of the DataFrame class.


1. Syntax

PySpark:

df.orderBy(*cols, ascending=True)
df.sort(*cols, ascending=True)

Spark SQL:

SELECT * FROM table_name 
ORDER BY col1 [ASC|DESC], col2 [ASC|DESC], ...;

2. Parameters

  • cols: A list of column names (as strings) or column objects to sort by.
  • ascending: A boolean or list of booleans specifying the sort order. Default is True (ascending).

3. Return Type

  • Returns a new DataFrame with rows sorted based on the specified columns.

4. Examples

Example 1: Sorting by a Single Column in Ascending Order

PySpark:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("SortExample").getOrCreate()

# Create DataFrame
data = [("Anand", 25), ("Bala", 30), ("Kavitha", 28), ("Raj", 35)]
columns = ["Name", "Age"]

df = spark.createDataFrame(data, columns)

# Sort by 'Age' in ascending order
sorted_df = df.orderBy("Age")
sorted_df.show()

Spark SQL:

SELECT * FROM people 
ORDER BY Age ASC;

Output:

+-------+---+
|   Name|Age|
+-------+---+
|  Anand| 25|
|Kavitha| 28|
|   Bala| 30|
|    Raj| 35|
+-------+---+

Example 2: Sorting by a Single Column in Descending Order

PySpark:

# Sort by 'Age' in descending order
sorted_df = df.orderBy("Age", ascending=False)
sorted_df.show()

Spark SQL:

SELECT * FROM people 
ORDER BY Age DESC;

Output:

+-------+---+
|   Name|Age|
+-------+---+
|    Raj| 35|
|   Bala| 30|
|Kavitha| 28|
|  Anand| 25|
+-------+---+

Example 3: Sorting by Multiple Columns

PySpark:

# Add a 'Salary' column
data = [("Anand", 25, 3000), ("Bala", 30, 4000), ("Kavitha", 28, 3500), ("Raj", 35, 4500)]
columns = ["Name", "Age", "Salary"]

df = spark.createDataFrame(data, columns)

# Sort by 'Age' in ascending order and 'Salary' in descending order
sorted_df = df.orderBy(["Age", "Salary"], ascending=[True, False])
sorted_df.show()

Spark SQL:

SELECT * FROM people 
ORDER BY Age ASC, Salary DESC;

Output:

+-------+---+------+
|   Name|Age|Salary|
+-------+---+------+
|  Anand| 25|  3000|
|Kavitha| 28|  3500|
|   Bala| 30|  4000|
|    Raj| 35|  4500|
+-------+---+------+

Example 4: Using sort() Instead of orderBy()

PySpark:

# Sort by 'Age' in ascending order using `sort()`
sorted_df = df.sort("Age")
sorted_df.show()

Spark SQL:

SELECT * FROM people 
ORDER BY Age ASC;

Output:

+-------+---+
|   Name|Age|
+-------+---+
|  Anand| 25|
|Kavitha| 28|
|   Bala| 30|
|    Raj| 35|
+-------+---+

Example 5: Sorting with Null Values

PySpark:

# Add a row with a null value
data = [("Anand", 25), ("Bala", None), ("Kavitha", 28), ("Raj", 35)]
columns = ["Name", "Age"]

df = spark.createDataFrame(data, columns)

# Sort by 'Age' with nulls first
sorted_df = df.orderBy("Age", ascending=True, nullsFirst=True)
sorted_df.show()

Spark SQL:

SELECT * FROM people 
ORDER BY Age ASC NULLS FIRST;

Output:

+-------+----+
|   Name| Age|
+-------+----+
|   Bala|null|
|  Anand|  25|
|Kavitha|  28|
|    Raj|  35|
+-------+----+

Example 6: Sorting by Multiple Columns with Mixed Order

PySpark:

# Sort by 'Age' in ascending order and 'Salary' in descending order
sorted_df = df.orderBy(["Age", "Salary"], ascending=[True, False])
sorted_df.show()

Spark SQL:

SELECT * FROM people 
ORDER BY Age ASC, Salary DESC;

Output:

+-------+---+------+
|   Name|Age|Salary|
+-------+---+------+
|  Anand| 25|  3000|
|Kavitha| 28|  3500|
|   Bala| 30|  4000|
|    Raj| 35|  4500|
+-------+---+------+

Example 7: Sorting with Nulls Last

PySpark:

# Sort by 'Age' with nulls last
sorted_df = df.orderBy("Age", ascending=True, nullsLast=True)
sorted_df.show()

Spark SQL:

SELECT * FROM people 
ORDER BY Age ASC NULLS LAST;

Output:

+-------+----+
|   Name| Age|
+-------+----+
|  Anand|  25|
|Kavitha|  28|
|    Raj|  35|
|   Bala|null|
+-------+----+

Example 8: Sorting by Expression

PySpark:

from pyspark.sql.functions import col

# Sort by the length of the 'Name' column
sorted_df = df.orderBy(col("Name").asc())
sorted_df.show()

Spark SQL:

SELECT * FROM people 
ORDER BY LENGTH(Name) ASC;

Output:

+-------+---+
|   Name|Age|
+-------+---+
|    Raj| 35|
|  Anand| 25|
|   Bala| 30|
|Kavitha| 28|
+-------+---+

5. Common Use Cases

  • Sorting data for display in reports or dashboards.
  • Preparing data for machine learning by ordering features or labels.
  • Sorting data before performing window operations (e.g., ranking, cumulative sums).

6. Performance Considerations

  • Use orderBy() or sort() judiciously on large datasets, as it involves shuffling and sorting.
  • Consider using repartition() or coalesce() to optimize performance when working with large datasets.

7. Key Takeaways

  1. The orderBy() and sort() commands are used to sort the rows of a DataFrame based on one or more columns.
  2. Both commands are interchangeable and support sorting in ascending or descending order.
  3. In Spark SQL, similar functionality can be achieved using ORDER BY.
  4. Works efficiently on large datasets when combined with proper partitioning and caching.