The SparkSession.builder.getOrCreate() method is used to create a SparkSession object, which is the entry point to programming with Spark. A SparkSession provides a unified interface to work with structured data (DataFrames and Datasets) and allows you to configure various Spark properties. If a SparkSession already exists, getOrCreate() returns the existing one; otherwise, it creates a new one.


1. Syntax

PySpark:

spark = SparkSession.builder \
                    .appName("AppName") \
                    .config("key", "value") \
                    .getOrCreate()

2. Key Features

  • Entry Point: The SparkSession is the entry point to Spark functionality.
  • Singleton: Ensures that only one SparkSession exists per JVM (or Python process).
  • Configuration: Allows you to configure Spark properties (e.g., app name, master URL, memory settings).
  • Unified API: Provides access to DataFrames, Datasets, SQL, and streaming.

3. Parameters

  • appName: Sets the name of the Spark application.
  • config: Allows you to set Spark configuration properties (e.g., spark.executor.memory, spark.sql.shuffle.partitions).
  • master: Specifies the master URL (e.g., local, yarn, k8s).

4. Examples

Example 1: Creating a Basic SparkSession

PySpark:

from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder \
                    .appName("MyApp") \
                    .getOrCreate()

# Print the SparkSession object
print(spark)

Output:

<pyspark.sql.session.SparkSession object at 0x7f8b1c2b3d90>

Example 2: Configuring Spark Properties

PySpark:

# Create a SparkSession with custom configurations
spark = SparkSession.builder \
                    .appName("MyApp") \
                    .config("spark.executor.memory", "2g") \
                    .config("spark.sql.shuffle.partitions", "100") \
                    .getOrCreate()

# Print the SparkSession object
print(spark)

Output:

<pyspark.sql.session.SparkSession object at 0x7f8b1c2b3d90>

Example 3: Using an Existing SparkSession

PySpark:

# Create a SparkSession
spark1 = SparkSession.builder \
                     .appName("MyApp") \
                     .getOrCreate()

# Try to create another SparkSession
spark2 = SparkSession.builder \
                     .appName("AnotherApp") \
                     .getOrCreate()

# Check if both SparkSession objects are the same
print(spark1 == spark2)  # Output: True

Output:

True

Example 4: Setting the Master URL

PySpark:

# Create a SparkSession with a local master URL
spark = SparkSession.builder \
                    .appName("MyApp") \
                    .master("local[*]") \
                    .getOrCreate()

# Print the SparkSession object
print(spark)

Output:

<pyspark.sql.session.SparkSession object at 0x7f8b1c2b3d90>

Example 5: Creating a SparkSession with Hive Support

PySpark:

# Create a SparkSession with Hive support
spark = SparkSession.builder \
                    .appName("MyApp") \
                    .enableHiveSupport() \
                    .getOrCreate()

# Print the SparkSession object
print(spark)

Output:

<pyspark.sql.session.SparkSession object at 0x7f8b1c2b3d90>

Example 6: Creating a SparkSession with a Specific Master URL

PySpark:

# Create a SparkSession with a specific master URL
spark = SparkSession.builder \
                    .appName("MyApp") \
                    .master("yarn") \
                    .getOrCreate()

# Print the SparkSession object
print(spark)

Output:

<pyspark.sql.session.SparkSession object at 0x7f8b1c2b3d90>

Example 7: Creating a SparkSession with Custom Logging

PySpark:

# Create a SparkSession with custom logging
spark = SparkSession.builder \
                    .appName("MyApp") \
                    .config("spark.logConf", "true") \
                    .getOrCreate()

# Print the SparkSession object
print(spark)

Output:

<pyspark.sql.session.SparkSession object at 0x7f8b1c2b3d90>

5. Common Use Cases

  • Creating a SparkSession to work with structured data (DataFrames and Datasets).
  • Configuring Spark properties for specific applications (e.g., memory, parallelism).
  • Ensuring a single SparkSession instance in an application.

6. Performance Considerations

  • Configuration: Properly configuring Spark properties (e.g., memory, parallelism) is essential for performance.
  • Singleton: Reusing an existing SparkSession avoids the overhead of creating a new one.

7. Key Takeaways

  1. Purpose: The SparkSession.builder.getOrCreate() method is used to create or retrieve a SparkSession object.
  2. Singleton: Ensures that only one SparkSession exists per JVM (or Python process).
  3. Configuration: Allows you to configure Spark properties (e.g., app name, master URL, memory settings).
  4. Performance: Creating a SparkSession is a lightweight operation, but configuring it properly is essential for performance.