A Dataset is a strongly typed collection of domain-specific objects that can be transformed in parallel using functional or relational operations. Each Dataset also has an untyped view called a DataFrame, which is a Dataset of Row. Operations available on Datasets are divided into transformations and actions. Transformations are the ones that produce new Datasets, and actions are the ones that trigger computation and return results. Example transformations include map, filter, select, and aggregate (groupBy). Example actions count, show, or writing data out to file systems. Datasets are “lazy”, i.e. computations are only triggered when an action is invoked. Internally, a Dataset represents a logical plan that describes the computation required to produce the data. When an action is invoked, Spark’s query optimizer optimizes the logical plan and generates a physical plan for efficient execution in a parallel and distributed manner. source
namedtuple
) representing your data structure. Spark can then infer the schema from the case class.Q: Can Datasets be used with Spark SQL?
Q: Can Datasets be used with PySpark?
Q: Are Datasets faster than DataFrames?
Q: Can I convert between Dataset and DataFrame?
Q: Are the operations the same as for DataFrames?