Spark Join Strategies
Join operation is a very common data processing operation. As a unified big data processing engine, spark provides very rich join scenarios.This article covers the different join strategies employed by Spark to perform the join
operation.
1. Broadcast hash join
When the dimension table and the fact table for the Join operation, in order to avoid shuffle, we can be limited size of the dimension table of all the data distributed to each node for the fact table to use executor all the data stored in the dimension table, to a certain extent, sacrifice the space, in exchange for shuffle operation a lot of time-consuming, which in SparkSQL called Broadcast Join, as shown below:

Broadcast hash join mainly includes two stages
- Broadcast Stage: the small table is cached in the executor
- Hash Join Phase: execute the hash join in each executor
Things to Note:
- The broadcasted relation should fit completely into the memory of each executor as well as the driver. In Driver, because driver will start the data transfer.
- Only supported for ‘=’ join.
- Supported for all join types(inner, left, right) except full outer joins.
- When the broadcast size is small, it is usually faster than other join strategies.
- Spark deploys this join strategy when the size of one of the
join
relations is less than the threshold values(default 10 M). The spark property which defines this threshold isspark.sql.autoBroadcastJoinThreshold
(configurable).
Drawbacks : This algorithm can only be used to broadcast smaller tables, otherwise the redundant transmission of data is much greater than the cost of shuffle; In addition, the broadcast needs to be broadcast-ed to the driver of the collector, when there are frequent broadcast, the driver’s Memory is also a test.
2. Shuffle Hash Join
This is used when Sort Merge Join is disabled or if the join columns doesn’t respect some specific conditions. And since it creates a hash table of the smallest dataset, the former should not be too big. In few scenarios Shuffle Hash join is faster than Sort Merge Join but in most of times the latter is a better choice due to its performance and scalability
When the amount of table data to be joined is large, you can select shuffle hash join. In this way, the large table can be adjusted Repartition according to the key of join To ensure that every same join key is sent to the same partition.

As shown in the figure above, the basic steps of shuffle hash join are as follows:
- First of all, two tables participating in join are re partitioned according to the join key. This process involves shuffle. The purpose is to send the data of the same join key to the same partition, so as to facilitate the join in the partition.
- Secondly, for each partition after shuffle, the partition data of the small table will be constructed into a hash table, and then match with the partition data records of the large table according to the join key.
Things to Note:
- Only supported for ‘=’ join.
- The join keys don’t need to be sortable.
- Supported for all join types except full outer joins.
- It’s an expensive join in a way that involves both shuffling and hashing(Hash Join as explained above). Maintaining a hash table requires memory and computation.
3. Sort Merge Join
The join mechanism is the default of spark, which can be adjusted by parameters spark.sql.join.preferSortMergeJoin To configure, the default value is true, that is, sort merge join is preferred. This method is generally used when two large tables are joined. Sort merge join can reduce the data transmission in the cluster. This method does not load all the data into the memory first, and then hashjoin, but it needs to sort the join keys before joining. Specific figure:

Sort merge join mainly includes three stages
- Shuffle Phase: shuffle the two large tables according to the join key
- Sort Phase: sort the data in each partition
- Merge Phase: join the sorted partition data from different tables, and merge the data sets by traversing the elements and connecting the rows with the same join key value
Conditions and characteristics
- Only equivalent connection is supported
- All join types are supported
- Join keys are sorted
- parameter spark.sql.join.prefersortmergejoin (true by default)Set to true.

Notes :
- If table size is < 10 MB size → Broad cast join will be used
- If table size is > 10 MB size → Sort merge join will be used which is the default join in spark 2.3
- Default value: 10485760 (10 MB)
- if you set spark.sql.autoBroadcastJoinThreshold to -1 then it will disable the broadcast join
- Spark broadcast joins are perfect for joining a large DataFrame with a small DataFrame.Broadcast joins cannot be used when joining two large DataFrames.
- If both of the tables are huge then using the broadcast hash join on them is not a good strategy.Because both huge tables cannot sit in a single machine & they have to be distributed into the partitions & put into multiple machines.
- In broad cast hash join, full copy of the small DF will be sent to each executor. If that small DF does not fit into the executor’s RAM, then it will lead to OOM errors.
- If Broadcast Hash join throws OOM issues, we can Increase the storage memory of executors. You cannot increase specific ones, it’s Increasing the memory of all executors.
How does Spark Select Join Strategy?
Spark selects Join strategy by considering the below:
- Type of Join
- Hint in Join
As shown in the above Flowchart, Spark selects the Join strategy based on Join type and Hints in Join. Spark 2.x supports Broadcast Hint alone whereas Spark 3.x supports all Join hints mentioned in the Flowchart.

Taken directly from spark code, let’s see how spark decides on join strategy.
If it is an ‘=’ join:
Look at the join hints, in the following order:
1. Broadcast Hint: Pick broadcast hash join
if the join type is supported.
2. Sort merge hint: Pick sort-merge join
if join keys are sortable.
3. shuffle hash hint: Pick shuffle hash join
if the join type is supported.
4. shuffle replicate NL hint: pick cartesian product
if join type is inner like.
If there is no hint or the hints are not applicable
1. Pick broadcast hash join
if one side is small enough to broadcast, and the join type is supported.
2. Pick shuffle hash join
if one side is small enough to build the local hash map, and is much smaller than the other side, and spark.sql.join.preferSortMergeJoin
is false.
3. Pick sort-merge join
if join keys are sortable.
4. Pick cartesian product
if join type is inner .
5. Pick broadcast nested loop join
as the final solution. It may OOM but there is no other choice.
If it’s not ‘=’ join:
Look at the join hints, in the following order:
1. broadcast hint: pick broadcast nested loop join
.
2. shuffle replicate NL hint: pick cartesian product
if join type is inner like.
