map
, filter
, join
) to an RDD, DataFrame, or Dataset, Spark doesn’t perform the computation right away. Instead, it creates a directed acyclic graph (DAG) representing the sequence of transformations. This DAG is also known as a lineage graph. This DAG keeps track of the operations that need to be performed.
Only when you call an action (e.g., collect
, count
, saveAsTextFile
), which requires a result to be returned to the driver program, does Spark execute the transformations. At this point, Spark optimizes the DAG, combining multiple transformations into a single optimized execution plan.
take(10)
), Spark only computes the necessary parts of the data, saving significant resources.
map
and filter
are transformations. They don’t execute until collect
(an action) is called. Spark optimizes the execution plan to perform both map
and filter
efficiently in a single pass.