Objectives
- Read from CSV files
- Read from JSON files
- Write DataFrame to files
- Write DataFrame to tables
- Write DataFrame to Delta tables
- DataFrameReader:
read
,csv
,json
,parquet
,table
,format
,load
,option
,schema
- DataFrameWriter:
write
,csv
,json
,parquet
,table
,format
,save
,option
,mode
,saveAsTable
, - StructType: toDDL
- Types:
StructType
,StructField
,StringType
,IntegerType
,DoubleType
,DateType
,TimestampType
,BooleanType
DataFrameReader
- The
DataFrameReader
class is used to read data from various sources (e.g., CSV, JSON, Parquet, Delta) into a DataFrame. - Interface used to load a DataFrame from external data source.
- The
DataFrameReader
class is available as theread
attribute of theSparkSession
object.
spark.read.parquet("/mnt/tables/sales")
- The above command reads the Parquet files from the specified path and returns a DataFrame.
- The
read
method returns aDataFrameReader
object that can be used to read data from various sources.
Read from CSV files
- Read from CSV with the DataFrameReader class’
csv
method.
StructType
and StructField
classes.
- The above code reads the CSV file from the specified path and returns a DataFrame with the specified schema.
- The
StructType
class is used to define the schema of the DataFrame. - The
StructField
class is used to define the fields of the schema. - The
StringType
,IntegerType
,DoubleType
, andDateType
classes are used to define the data types of the fields. - The
True
parameter in theStructField
constructor indicates that the field is nullable. - This time there is no need to spin the jobs as we have defined the schema explicitly.
- The
ddl_schema
variable contains the DDL syntax for the schema. - The
schema
parameter in thecsv
method is set to theddl_schema
variable.
Read from JSON files
- Read from JSON with the DataFrameReader class’
json
method.
- The above code reads the JSON file from the specified path and returns a DataFrame.
- This will spin a job as we have used the
inferSchema
option. If we explicitly define the schema, there will be no job spun.
StructType
Scala method toDDL
to convert the schema to DDL syntax. This will provide you the DDL-formatted string. This will be useful if you want to get the DDL formated string for ingesting CSV or JSON. But you don’t want to write the StructType variant of the schema.
This functionality is not available in Python. But you can use the toDDL
method in Scala and then copy the DDL-formatted string to Python.