cache()
function in Spark is used to persist a DataFrame or Dataset in memory (or on disk, depending on the storage level) for faster access in subsequent operations. This is particularly useful when you need to reuse a DataFrame multiple times in a workflow, as it avoids recomputing the entire lineage of transformations.
cache()
stores the DataFrame in memory.count()
, show()
) is triggered.persist()
.MEMORY_ONLY
: Stores data in memory only (default for cache()
).MEMORY_AND_DISK
: Stores data in memory, but spills to disk if memory is insufficient.MEMORY_ONLY_SER
: Stores serialized data in memory.MEMORY_AND_DISK_SER
: Stores serialized data in memory and spills to disk if needed.DISK_ONLY
: Stores data on disk only.persist()
function instead of cache()
:
df
will be faster.persist()
with a Specific Storage Levelcache()
function is used to persist a DataFrame or Dataset in memory for faster access.persist()
to specify custom storage levels (e.g., memory-only, memory-and-disk).