The cache() function in Spark is used to persist a DataFrame or Dataset in memory (or on disk, depending on the storage level) for faster access in subsequent operations. This is particularly useful when you need to reuse a DataFrame multiple times in a workflow, as it avoids recomputing the entire lineage of transformations.


1. Syntax

PySpark:

df.cache()

Spark SQL:

  • There is no direct equivalent in Spark SQL, but you can cache a table using:
    CACHE TABLE table_name;
    

2. Key Features

  • In-Memory Storage: By default, cache() stores the DataFrame in memory.
  • Lazy Evaluation: The DataFrame is not cached immediately. Instead, it is cached when an action (e.g., count(), show()) is triggered.
  • Storage Levels: You can specify different storage levels (e.g., memory-only, memory-and-disk) using persist().

3. Storage Levels

Spark provides several storage levels for caching:

  • MEMORY_ONLY: Stores data in memory only (default for cache()).
  • MEMORY_AND_DISK: Stores data in memory, but spills to disk if memory is insufficient.
  • MEMORY_ONLY_SER: Stores serialized data in memory.
  • MEMORY_AND_DISK_SER: Stores serialized data in memory and spills to disk if needed.
  • DISK_ONLY: Stores data on disk only.

To use a specific storage level, use the persist() function instead of cache():

df.persist(storageLevel=pyspark.StorageLevel.MEMORY_AND_DISK)

4. Examples

Example 1: Caching a DataFrame

PySpark:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("CacheExample").getOrCreate()

# Create DataFrame
data = [("Anand", 25), ("Bala", 30), ("Kavitha", 28), ("Raj", 35)]
columns = ["Name", "Age"]

df = spark.createDataFrame(data, columns)

# Cache the DataFrame
df.cache()

# Perform an action to trigger caching
df.count()

Spark SQL:

CACHE TABLE people;

Output:

  • The DataFrame is cached in memory, and subsequent operations on df will be faster.

Example 2: Checking if a DataFrame is Cached

PySpark:

# Check if the DataFrame is cached
print(df.is_cached)  # Output: True

Example 3: Unpersisting a DataFrame

PySpark:

# Unpersist (remove from cache)
df.unpersist()

# Check if the DataFrame is cached
print(df.is_cached)  # Output: False

Example 4: Using persist() with a Specific Storage Level

PySpark:

from pyspark import StorageLevel

# Persist the DataFrame in memory and on disk
df.persist(StorageLevel.MEMORY_AND_DISK)

# Perform an action to trigger persistence
df.count()

Example 5: Caching a Filtered DataFrame

PySpark:

# Filter and cache the DataFrame
filtered_df = df.filter("Age > 30")
filtered_df.cache()

# Perform an action to trigger caching
filtered_df.count()

Example 6: Caching a Joined DataFrame

PySpark:

# Create another DataFrame
departments_data = [(101, "Sales"), (102, "HR"), (103, "Finance")]
departments_columns = ["DeptID", "DeptName"]

departments_df = spark.createDataFrame(departments_data, departments_columns)

# Join and cache the DataFrame
joined_df = df.join(departments_df, df["Age"] == departments_df["DeptID"])
joined_df.cache()

# Perform an action to trigger caching
joined_df.count()

Example 7: Caching a Large DataFrame with Disk Spill

PySpark:

from pyspark import StorageLevel

# Persist a large DataFrame with disk spill
large_df = spark.range(1000000)  # Create a large DataFrame
large_df.persist(StorageLevel.MEMORY_AND_DISK)

# Perform an action to trigger persistence
large_df.count()

5. Common Use Cases

  • Reusing intermediate results in multi-step workflows.
  • Speeding up iterative algorithms (e.g., training machine learning models).
  • Avoiding recomputation in complex transformations.

6. Performance Considerations

  • Memory Usage: Caching consumes memory. Ensure you have sufficient memory for large datasets.
  • Disk Spill: If memory is insufficient, data may spill to disk, which is slower.
  • Unpersist: Always unpersist DataFrames when they are no longer needed to free up resources.

7. Key Takeaways

  1. Purpose: The cache() function is used to persist a DataFrame or Dataset in memory for faster access.
  2. Lazy Evaluation: Caching is triggered only when an action is performed.
  3. Storage Levels: Use persist() to specify custom storage levels (e.g., memory-only, memory-and-disk).
  4. Performance: Caching can significantly improve performance but consumes memory. Use it judiciously.