Spark: broadcast function

The broadcast(df) function in Spark is used to explicitly broadcast a DataFrame or Dataset to all nodes in the cluster. Broadcasting is a technique used to optimize join operations by sending a small DataFrame to all worker nodes, reducing the amount of data shuffled across the network. This is particularly useful when joining a large DataFrame with a small DataFrame.

1. Syntax

PySpark:

from pyspark.sql.functions import broadcast

broadcast_df = broadcast(df)

Spark SQL:

There is no direct equivalent in Spark SQL, but you can use BROADCAST hint in SQL queries.

2. Key Features

Optimization: Reduces data shuffling by sending a small DataFrame to all worker nodes.
Efficiency: Improves the performance of join operations when one DataFrame is small.
Automatic Broadcasting: Spark automatically broadcasts small DataFrames (based on spark.sql.autoBroadcastJoinThreshold), but you can use broadcast() to explicitly control broadcasting.

3. Parameters

df: The DataFrame or Dataset to broadcast.

4. Examples

Example 1: Broadcasting a Small DataFrame for a Join

PySpark:

from pyspark.sql import SparkSession
from pyspark.sql.functions import broadcast

spark = SparkSession.builder.appName("BroadcastExample").getOrCreate()

# Create large DataFrame
large_data = [("Anand", 101), ("Bala", 102), ("Kavitha", 103), ("Raj", 104)]
large_columns = ["Name", "DeptID"]
large_df = spark.createDataFrame(large_data, large_columns)

# Create small DataFrame
small_data = [(101, "Sales"), (102, "HR"), (103, "Finance")]
small_columns = ["DeptID", "DeptName"]
small_df = spark.createDataFrame(small_data, small_columns)

# Broadcast the small DataFrame
broadcast_df = broadcast(small_df)

# Perform a join using the broadcast DataFrame
joined_df = large_df.join(broadcast_df, "DeptID")
joined_df.show()

Spark SQL:

SELECT * 
FROM large_table 
JOIN /*+ BROADCAST(small_table) */ small_table 
ON large_table.DeptID = small_table.DeptID;

Output:

+------+-------+--------+
|DeptID|   Name|DeptName|
+------+-------+--------+
|   101|  Anand|   Sales|
|   102|   Bala|      HR|
|   103|Kavitha| Finance|
+------+-------+--------+

Example 2: Broadcasting a DataFrame with Aggregations

PySpark:

from pyspark.sql.functions import sum

# Aggregate the small DataFrame
aggregated_df = small_df.groupBy("DeptID").agg(sum("DeptID").alias("TotalDeptID"))

# Broadcast the aggregated DataFrame
broadcast_df = broadcast(aggregated_df)

# Perform a join using the broadcast DataFrame
joined_df = large_df.join(broadcast_df, "DeptID")
joined_df.show()

Output:

+------+-------+-----------+
|DeptID|   Name|TotalDeptID|
+------+-------+-----------+
|   101|  Anand|        101|
|   102|   Bala|        102|
|   103|Kavitha|        103|
+------+-------+-----------+

Example 3: Broadcasting a DataFrame with Filters

PySpark:

# Filter the small DataFrame
filtered_df = small_df.filter(col("DeptName") == "Sales")

# Broadcast the filtered DataFrame
broadcast_df = broadcast(filtered_df)

# Perform a join using the broadcast DataFrame
joined_df = large_df.join(broadcast_df, "DeptID")
joined_df.show()

Output:

+------+-----+--------+
|DeptID| Name|DeptName|
+------+-----+--------+
|   101|Anand|   Sales|
+------+-----+--------+

Example 4: Broadcasting a DataFrame with Complex Data

PySpark:

from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# Define schema with complex data
schema = StructType([
    StructField("DeptID", IntegerType(), True),
    StructField("DeptName", StringType(), True),
    StructField("Location", StringType(), True)
])

# Create small DataFrame with complex data
small_data = [(101, "Sales", "Chennai"), (102, "HR", "Bangalore"), (103, "Finance", "Hyderabad")]
small_df = spark.createDataFrame(small_data, schema)

# Broadcast the small DataFrame
broadcast_df = broadcast(small_df)

# Perform a join using the broadcast DataFrame
joined_df = large_df.join(broadcast_df, "DeptID")
joined_df.show()

Output:

+------+-------+--------+---------+
|DeptID|   Name|DeptName| Location|
+------+-------+--------+---------+
|   101|  Anand|   Sales|  Chennai|
|   102|   Bala|      HR|Bangalore|
|   103|Kavitha| Finance|Hyderabad|
+------+-------+--------+---------+

Example 5: Broadcasting a DataFrame with Multiple Joins

PySpark:

# Create another small DataFrame
location_data = [(101, "India"), (102, "India"), (103, "India")]
location_columns = ["DeptID", "Country"]
location_df = spark.createDataFrame(location_data, location_columns)

# Broadcast both small DataFrames
broadcast_small_df = broadcast(small_df)
broadcast_location_df = broadcast(location_df)

# Perform multiple joins using the broadcast DataFrames
joined_df = large_df.join(broadcast_small_df, "DeptID").join(broadcast_location_df, "DeptID")
joined_df.show()

Output:

+------+-------+--------+---------+-------+
|DeptID|   Name|DeptName| Location|Country|
+------+-------+--------+---------+-------+
|   101|  Anand|   Sales|  Chennai|  India|
|   102|   Bala|      HR|Bangalore|  India|
|   103|Kavitha| Finance|Hyderabad|  India|
+------+-------+--------+---------+-------+

Example 6: Broadcasting a DataFrame with Custom Configuration

PySpark:

# Set a custom broadcast threshold
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "10MB")

# Broadcast the small DataFrame
broadcast_df = broadcast(small_df)

# Perform a join using the broadcast DataFrame
joined_df = large_df.join(broadcast_df, "DeptID")
joined_df.show()

Output:

+------+-------+--------+
|DeptID|   Name|DeptName|
+------+-------+--------+
|   101|  Anand|   Sales|
|   102|   Bala|      HR|
|   103|Kavitha| Finance|
+------+-------+--------+

5. Common Use Cases

Joining a large DataFrame with a small DataFrame.
Optimizing performance by reducing network overhead.
Explicitly controlling broadcasting for better performance.

6. Performance Considerations

Memory Usage: Broadcasting a large DataFrame can lead to out-of-memory errors. Use it only for small DataFrames.
Automatic Broadcasting: Spark automatically broadcasts small DataFrames (based on spark.sql.autoBroadcastJoinThreshold), but you can use broadcast() to explicitly control broadcasting.

7. Key Takeaways

The broadcast(df) function is used to explicitly broadcast a DataFrame or Dataset to all nodes in the cluster.
Reduces data shuffling and improves the performance of join operations.
Broadcasting is efficient for small DataFrames but should be avoided for large DataFrames due to memory constraints.

​1. Syntax

​2. Key Features

​3. Parameters

​4. Examples

​Example 1: Broadcasting a Small DataFrame for a Join

​Example 2: Broadcasting a DataFrame with Aggregations

​Example 3: Broadcasting a DataFrame with Filters

​Example 4: Broadcasting a DataFrame with Complex Data

​Example 5: Broadcasting a DataFrame with Multiple Joins

​Example 6: Broadcasting a DataFrame with Custom Configuration

​5. Common Use Cases

​6. Performance Considerations

​7. Key Takeaways

1. Syntax

2. Key Features

3. Parameters

4. Examples

Example 1: Broadcasting a Small DataFrame for a Join

Example 2: Broadcasting a DataFrame with Aggregations

Example 3: Broadcasting a DataFrame with Filters

Example 4: Broadcasting a DataFrame with Complex Data

Example 5: Broadcasting a DataFrame with Multiple Joins

Example 6: Broadcasting a DataFrame with Custom Configuration

5. Common Use Cases

6. Performance Considerations

7. Key Takeaways