Dataset A
and Dataset B
) are partitioned based on the join key. This means that data with the same join key will end up on the same partition across the cluster. The number of partitions is determined by the Spark configuration (e.g., spark.sql.shuffle.partitions
).Dataset A
: (CustomerID, OrderID, Amount)
Dataset B
: (CustomerID, CustomerName, City)
CustomerID
.
CustomerID
.CustomerID
are shuffled to the same executor.Dataset A
and Dataset B
with matching CustomerID
are joined.(CustomerID, OrderID, Amount, CustomerName, City)
.