Apache Spark Core
Introduction to Apache Spark Core
Spark SQL and the DataFrame API
- SparkSession: sql, table
- DataFrame
- Transformations: select, filter, groupBy, join, union, orderBy
- Actions: show, collect, count, first, take
- Other methods: printSchema, schema, createOrReplaceTempView
Spark SQL is a module for structured data processing with multiple interfaces.
We can interact with Spark SQL in two ways:
- Executing SQL queries
- Working with the DataFrame API
Query execution We can express the same query using any interface. The Spark SQL engine generates the same query plan used to optimize and execute on our Spark cluster.
The below happens in the Query Execution Engine: Query Plan ==> Optimized Query Plan ==> RDDs ==> Execution
SparkSession
- The SparkSession class is the single entry point to all functionality in Spark using the DataFrame API.
- In databricks notebooks, the SparkSession is available as the variable
spark
. - In other environments, we can create a SparkSession using the following code:
products_df = spark.table("products")
- The
table
method returns a DataFrame representing the data in the products table.
SparkSession methods:
sql
: Executes a SQL query and returns the result as a DataFrame.table
: Returns a DataFrame representing the data in the specified table.read
: Returns a DataFrameReader object that can be used to read data from various sources (e.g., CSV, Parquet, JSON).createDataFrame
: Creates a DataFrame from a local collection of data.range
: Creates a DataFrame with a range of numbers.createOrReplaceTempView
: Creates or replaces a temporary view with the specified name.
SparkSession method to run SQL:
DataFrames
- A DataFrame is a distributed collection of data organized into named columns. It is similar to a table in a relational database or a DataFrame in R or Python.
- DataFrames are immutable, meaning that once created, they cannot be changed. Instead, we create new DataFrames by applying transformations to existing ones.
- DataFrames are distributed, meaning that they are divided into partitions that can be processed in parallel across a cluster of machines.
- DataFrames can be created from various sources, including structured data files (e.g., CSV, Parquet), tables in Hive, and existing RDDs.
- DataFrames can be manipulated using a variety of operations, including filtering, grouping, and aggregating data.
- DataFrames can be queried using SQL syntax, allowing users to leverage their existing SQL knowledge.
- DataFrames can be cached in memory for faster access, improving performance for iterative algorithms and interactive queries.
- DataFrames can be converted to and from RDDs, allowing users to leverage the full power of Spark’s RDD API when needed.
-
The above code creates a DataFrame from the products table, filters the data to include only products with a price greater than 100, selects specific columns, and orders the results by price.
-
df.show()
displays the first 20 rows of the DataFrame in a tabular format. -
df.schema()
returns the schema of the DataFrame, which describes the structure of the data, including the column names and data types. -
df.printSchema()
prints the schema of the DataFrame in a tree format.
Transformations operate on and return DataFrames, allowing us to chain transformation methods together to construct new DataFrames. However, these operations can’t execute on their own, as transformation methods are lazily evaluated. This means that Spark will not execute the transformations until an action is called. This allows Spark to optimize the execution plan and minimize data shuffling.
Actions: DataFrame actions are methods that trigger computation. Actions are needed to trigger the executino of any DataFrane transformations.
- Actions return a value to the driver program or write data to an external storage system.
- Examples of actions include
show
,collect
,count
,first
,head
,describe
,summary
andtake
.- show: Displays the first n rows of the DataFrame in a tabular format.
- collect: Returns all the rows of the DataFrame as a list of Row objects.
- count: Returns the number of rows in the DataFrame.
- first/head: Returns the first row of the DataFrame as a Row object.
- describe: Returns a DataFrame that contains basic statistics for numeric columns in the DataFrame.
- summary: Returns a DataFrame that contains basic statistics for all columns in the DataFrame.
- take: Returns the first n rows of the DataFrame as a list of Row objects.
Convert between DataFrames and SQL
df.createOrReplaceTempView(“products”)
- The
createOrReplaceTempView
method creates a temporary view of the DataFrame that can be queried using SQL.