Apache Spark Core Concepts
Resilient Distributed Dataset (RDD): The fundamental data structure of Spark. DataFrame: Distributed collection of data organized into named columns. Dataset: Type-safe, object-oriented programming interface for Spark. Spark Session: Entry point to programming with Spark. Driver Program: The process running the main() function of the application. Executor: Worker node processes that run tasks. Cluster Manager: Manages resources across the cluster (e.g., YARN, Mesos, Kubernetes). Task: A unit of work sent to an executor. Job: A parallel computation consisting of multiple tasks. Stage: A set of tasks that can be executed together (based on shuffles). DAG (Directed Acyclic Graph): Logical execution plan of a Spark job. Shuffle: Redistributes data across partitions, often expensive in terms of performance. Partition: A chunk of data split across the cluster for parallel processing. Lazy Evaluation: Spark delays computation until an action is called. Transformation: Operations that produce a new RDD/DataFrame (e.g., map, filter). Action: Operations that trigger computation and return results (e.g., count, collect).