Reference
Spark Session Builder
The SparkSession.builder.getOrCreate()
method is used to create a SparkSession object, which is the entry point to programming with Spark. A SparkSession provides a unified interface to work with structured data (DataFrames and Datasets) and allows you to configure various Spark properties. If a SparkSession already exists, getOrCreate()
returns the existing one; otherwise, it creates a new one.
1. Syntax
PySpark:
2. Key Features
- Entry Point: The SparkSession is the entry point to Spark functionality.
- Singleton: Ensures that only one SparkSession exists per JVM (or Python process).
- Configuration: Allows you to configure Spark properties (e.g., app name, master URL, memory settings).
- Unified API: Provides access to DataFrames, Datasets, SQL, and streaming.
3. Parameters
- appName: Sets the name of the Spark application.
- config: Allows you to set Spark configuration properties (e.g.,
spark.executor.memory
,spark.sql.shuffle.partitions
). - master: Specifies the master URL (e.g.,
local
,yarn
,k8s
).
4. Examples
Example 1: Creating a Basic SparkSession
PySpark:
Output:
Example 2: Configuring Spark Properties
PySpark:
Output:
Example 3: Using an Existing SparkSession
PySpark:
Output:
Example 4: Setting the Master URL
PySpark:
Output:
Example 5: Creating a SparkSession with Hive Support
PySpark:
Output:
Example 6: Creating a SparkSession with a Specific Master URL
PySpark:
Output:
Example 7: Creating a SparkSession with Custom Logging
PySpark:
Output:
5. Common Use Cases
- Creating a SparkSession to work with structured data (DataFrames and Datasets).
- Configuring Spark properties for specific applications (e.g., memory, parallelism).
- Ensuring a single SparkSession instance in an application.
6. Performance Considerations
- Configuration: Properly configuring Spark properties (e.g., memory, parallelism) is essential for performance.
- Singleton: Reusing an existing SparkSession avoids the overhead of creating a new one.
7. Key Takeaways
- Purpose: The
SparkSession.builder.getOrCreate()
method is used to create or retrieve a SparkSession object. - Singleton: Ensures that only one SparkSession exists per JVM (or Python process).
- Configuration: Allows you to configure Spark properties (e.g., app name, master URL, memory settings).
- Performance: Creating a SparkSession is a lightweight operation, but configuring it properly is essential for performance.