Reference
Spark: union function
The union()
command in Spark is used to combine two DataFrames with the same schema (i.e., the same column names and data types) into a single DataFrame. It appends the rows of one DataFrame to another, similar to the SQL UNION ALL
operation. If you want to remove duplicates, you can use distinct()
after the union.
1. Syntax
PySpark:
Spark SQL:
2. Parameters
- df2: The DataFrame to union with. It must have the same schema as
df1
.
3. Return Type
- Returns a new DataFrame containing all rows from both DataFrames.
4. Examples
Example 1: Basic Union of Two DataFrames
PySpark:
Spark SQL:
Output:
Example 2: Union with Duplicates
PySpark:
Spark SQL:
Output:
Example 3: Union with Removal of Duplicates
PySpark:
Spark SQL:
Output:
Example 4: Union of DataFrames with Different Column Orders
PySpark:
Spark SQL:
Output:
Example 5: Union of DataFrames with Null Values
PySpark:
Spark SQL:
Output:
Example 6: Union of DataFrames with Different Schemas (Error Case)
PySpark:
Output:
Example 7: Union of DataFrames with Different Column Names (Error Case)
PySpark:
Output:
5. Common Use Cases
- Combining datasets from different time periods (e.g., daily logs, monthly reports).
- Appending new records to an existing dataset.
- Merging datasets from multiple sources with the same schema.
6. Performance Considerations
- Use
union()
judiciously on large datasets, as it can increase the size of the DataFrame. - Use
distinct()
afterunion()
if you need to remove duplicates, but be aware that it involves shuffling and sorting.
7. Key Takeaways
- The
union()
command is used to combine two DataFrames with the same schema into a single DataFrame. - It appends rows from one DataFrame to another, similar to SQL
UNION ALL
. - In Spark SQL, similar functionality can be achieved using
UNION ALL
orUNION
(to remove duplicates). - Works efficiently on large datasets as it does not involve data transformation.