Introduction to Apache Spark Core
Spark SQL and the DataFrame API
- SparkSession: sql, table
- DataFrame
- Transformations: select, filter, groupBy, join, union, orderBy
- Actions: show, collect, count, first, take
- Other methods: printSchema, schema, createOrReplaceTempView
- Executing SQL queries
- Working with the DataFrame API
SparkSession
- The SparkSession class is the single entry point to all functionality in Spark using the DataFrame API.
- In databricks notebooks, the SparkSession is available as the variable
spark
. - In other environments, we can create a SparkSession using the following code:
products_df = spark.table("products")
- The
table
method returns a DataFrame representing the data in the products table.
sql
: Executes a SQL query and returns the result as a DataFrame.table
: Returns a DataFrame representing the data in the specified table.read
: Returns a DataFrameReader object that can be used to read data from various sources (e.g., CSV, Parquet, JSON).createDataFrame
: Creates a DataFrame from a local collection of data.range
: Creates a DataFrame with a range of numbers.createOrReplaceTempView
: Creates or replaces a temporary view with the specified name.
- A DataFrame is a distributed collection of data organized into named columns. It is similar to a table in a relational database or a DataFrame in R or Python.
- DataFrames are immutable, meaning that once created, they cannot be changed. Instead, we create new DataFrames by applying transformations to existing ones.
- DataFrames are distributed, meaning that they are divided into partitions that can be processed in parallel across a cluster of machines.
- DataFrames can be created from various sources, including structured data files (e.g., CSV, Parquet), tables in Hive, and existing RDDs.
- DataFrames can be manipulated using a variety of operations, including filtering, grouping, and aggregating data.
- DataFrames can be queried using SQL syntax, allowing users to leverage their existing SQL knowledge.
- DataFrames can be cached in memory for faster access, improving performance for iterative algorithms and interactive queries.
- DataFrames can be converted to and from RDDs, allowing users to leverage the full power of Sparkās RDD API when needed.
- The above code creates a DataFrame from the products table, filters the data to include only products with a price greater than 100, selects specific columns, and orders the results by price.
-
df.show()
displays the first 20 rows of the DataFrame in a tabular format. -
df.schema()
returns the schema of the DataFrame, which describes the structure of the data, including the column names and data types. -
df.printSchema()
prints the schema of the DataFrame in a tree format.
- Actions return a value to the driver program or write data to an external storage system.
- Examples of actions include
show
,collect
,count
,first
,head
,describe
,summary
andtake
.- show: Displays the first n rows of the DataFrame in a tabular format.
- collect: Returns all the rows of the DataFrame as a list of Row objects.
- count: Returns the number of rows in the DataFrame.
- first/head: Returns the first row of the DataFrame as a Row object.
- describe: Returns a DataFrame that contains basic statistics for numeric columns in the DataFrame.
- summary: Returns a DataFrame that contains basic statistics for all columns in the DataFrame.
- take: Returns the first n rows of the DataFrame as a list of Row objects.
Convert between DataFrames and SQL
df.createOrReplaceTempView(āproductsā)- The
createOrReplaceTempView
method creates a temporary view of the DataFrame that can be queried using SQL.