Resilient Distributed Dataset (RDD): The fundamental data structure of Spark.
DataFrame: Distributed collection of data organized into named columns.
Dataset: Type-safe, object-oriented programming interface for Spark.
Spark Session: Entry point to programming with Spark.
Driver Program: The process running the main() function of the application.
Executor: Worker node processes that run tasks.
Cluster Manager: Manages resources across the cluster (e.g., YARN, Mesos, Kubernetes).
Task: A unit of work sent to an executor.
Job: A parallel computation consisting of multiple tasks.
Stage: A set of tasks that can be executed together (based on shuffles).
DAG (Directed Acyclic Graph): Logical execution plan of a Spark job.
Shuffle: Redistributes data across partitions, often expensive in terms of performance.
Partition: A chunk of data split across the cluster for parallel processing.
Lazy Evaluation: Spark delays computation until an action is called.
Transformation: Operations that produce a new RDD/DataFrame (e.g., map, filter).
Action: Operations that trigger computation and return results (e.g., count, collect).
Assistant
Responses are generated using AI and may contain mistakes.