Reference
Spark: cache function
The cache()
function in Spark is used to persist a DataFrame or Dataset in memory (or on disk, depending on the storage level) for faster access in subsequent operations. This is particularly useful when you need to reuse a DataFrame multiple times in a workflow, as it avoids recomputing the entire lineage of transformations.
1. Syntax
PySpark:
Spark SQL:
- There is no direct equivalent in Spark SQL, but you can cache a table using:
2. Key Features
- In-Memory Storage: By default,
cache()
stores the DataFrame in memory. - Lazy Evaluation: The DataFrame is not cached immediately. Instead, it is cached when an action (e.g.,
count()
,show()
) is triggered. - Storage Levels: You can specify different storage levels (e.g., memory-only, memory-and-disk) using
persist()
.
3. Storage Levels
Spark provides several storage levels for caching:
MEMORY_ONLY
: Stores data in memory only (default forcache()
).MEMORY_AND_DISK
: Stores data in memory, but spills to disk if memory is insufficient.MEMORY_ONLY_SER
: Stores serialized data in memory.MEMORY_AND_DISK_SER
: Stores serialized data in memory and spills to disk if needed.DISK_ONLY
: Stores data on disk only.
To use a specific storage level, use the persist()
function instead of cache()
:
4. Examples
Example 1: Caching a DataFrame
PySpark:
Spark SQL:
Output:
- The DataFrame is cached in memory, and subsequent operations on
df
will be faster.
Example 2: Checking if a DataFrame is Cached
PySpark:
Example 3: Unpersisting a DataFrame
PySpark:
Example 4: Using persist()
with a Specific Storage Level
PySpark:
Example 5: Caching a Filtered DataFrame
PySpark:
Example 6: Caching a Joined DataFrame
PySpark:
Example 7: Caching a Large DataFrame with Disk Spill
PySpark:
5. Common Use Cases
- Reusing intermediate results in multi-step workflows.
- Speeding up iterative algorithms (e.g., training machine learning models).
- Avoiding recomputation in complex transformations.
6. Performance Considerations
- Memory Usage: Caching consumes memory. Ensure you have sufficient memory for large datasets.
- Disk Spill: If memory is insufficient, data may spill to disk, which is slower.
- Unpersist: Always unpersist DataFrames when they are no longer needed to free up resources.
7. Key Takeaways
- Purpose: The
cache()
function is used to persist a DataFrame or Dataset in memory for faster access. - Lazy Evaluation: Caching is triggered only when an action is performed.
- Storage Levels: Use
persist()
to specify custom storage levels (e.g., memory-only, memory-and-disk). - Performance: Caching can significantly improve performance but consumes memory. Use it judiciously.