Apache Spark Core

Introduction to Apache Spark Core

Spark SQL and the DataFrame API

SparkSession: sql, table
DataFrame
- Transformations: select, filter, groupBy, join, union, orderBy
- Actions: show, collect, count, first, take
- Other methods: printSchema, schema, createOrReplaceTempView

Spark SQL is a module for structured data processing with multiple interfaces. We can interact with Spark SQL in two ways:

Executing SQL queries
Working with the DataFrame API

Query execution We can express the same query using any interface. The Spark SQL engine generates the same query plan used to optimize and execute on our Spark cluster.

  |- SQL Queries----------|
--|- Python DataFrame API-|=>Query Plan ==> Optimized Query Plan ==> RDDs ==> Execution
  |- Scala DataFrame API--|

The below happens in the Query Execution Engine: Query Plan ==> Optimized Query Plan ==> RDDs ==> Execution

SparkSession

The SparkSession class is the single entry point to all functionality in Spark using the DataFrame API.
In databricks notebooks, the SparkSession is available as the variable spark.
In other environments, we can create a SparkSession using the following code:

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("MyApp").getOrCreate()

products_df = spark.table("products")

The table method returns a DataFrame representing the data in the products table.

SparkSession methods:

sql: Executes a SQL query and returns the result as a DataFrame.
table: Returns a DataFrame representing the data in the specified table.
read: Returns a DataFrameReader object that can be used to read data from various sources (e.g., CSV, Parquet, JSON).
createDataFrame: Creates a DataFrame from a local collection of data.
range: Creates a DataFrame with a range of numbers.
createOrReplaceTempView: Creates or replaces a temporary view with the specified name.

SparkSession method to run SQL:

result_df = spark.sql("SELECT * FROM products WHERE price > 100")
display(result_df)

DataFrames

A DataFrame is a distributed collection of data organized into named columns. It is similar to a table in a relational database or a DataFrame in R or Python.
DataFrames are immutable, meaning that once created, they cannot be changed. Instead, we create new DataFrames by applying transformations to existing ones.
DataFrames are distributed, meaning that they are divided into partitions that can be processed in parallel across a cluster of machines.
DataFrames can be created from various sources, including structured data files (e.g., CSV, Parquet), tables in Hive, and existing RDDs.
DataFrames can be manipulated using a variety of operations, including filtering, grouping, and aggregating data.
DataFrames can be queried using SQL syntax, allowing users to leverage their existing SQL knowledge.
DataFrames can be cached in memory for faster access, improving performance for iterative algorithms and interactive queries.
DataFrames can be converted to and from RDDs, allowing users to leverage the full power of Spark’s RDD API when needed.

df = spark
    .table("products")
    .filter("price > 100")
    .select("product_id", "product_name", "price")
    .orderBy("price")
    .show()

The above code creates a DataFrame from the products table, filters the data to include only products with a price greater than 100, selects specific columns, and orders the results by price.
df.show() displays the first 20 rows of the DataFrame in a tabular format.
df.schema() returns the schema of the DataFrame, which describes the structure of the data, including the column names and data types.
df.printSchema() prints the schema of the DataFrame in a tree format.

Transformations operate on and return DataFrames, allowing us to chain transformation methods together to construct new DataFrames. However, these operations can’t execute on their own, as transformation methods are lazily evaluated. This means that Spark will not execute the transformations until an action is called. This allows Spark to optimize the execution plan and minimize data shuffling. Actions: DataFrame actions are methods that trigger computation. Actions are needed to trigger the executino of any DataFrane transformations.

Actions return a value to the driver program or write data to an external storage system.
Examples of actions include show, collect, count, first, head, describe, summary and take.
- show: Displays the first n rows of the DataFrame in a tabular format.
- collect: Returns all the rows of the DataFrame as a list of Row objects.
- count: Returns the number of rows in the DataFrame.
- first/head: Returns the first row of the DataFrame as a Row object.
- describe: Returns a DataFrame that contains basic statistics for numeric columns in the DataFrame.
- summary: Returns a DataFrame that contains basic statistics for all columns in the DataFrame.
- take: Returns the first n rows of the DataFrame as a list of Row objects.

Convert between DataFrames and SQL

df.createOrReplaceTempView(“products”)

The createOrReplaceTempView method creates a temporary view of the DataFrame that can be queried using SQL.

df = spark.sql("SELECT * FROM products")
display(df)

Spark Programming with Databricks

​Introduction to Apache Spark Core

​Spark SQL and the DataFrame API

​SparkSession

​Convert between DataFrames and SQL

Introduction to Apache Spark Core

Spark SQL and the DataFrame API

SparkSession

Convert between DataFrames and SQL