Objectives
- Read from CSV files
- Read from JSON files
- Write DataFrame to files
- Write DataFrame to tables
- Write DataFrame to Delta tables
- DataFrameReader:
read,csv,json,parquet,table,format,load,option,schema - DataFrameWriter:
write,csv,json,parquet,table,format,save,option,mode,saveAsTable, - StructType: toDDL
- Types:
StructType,StructField,StringType,IntegerType,DoubleType,DateType,TimestampType,BooleanType
DataFrameReader
- The
DataFrameReaderclass is used to read data from various sources (e.g., CSV, JSON, Parquet, Delta) into a DataFrame. - Interface used to load a DataFrame from external data source.
- The
DataFrameReaderclass is available as thereadattribute of theSparkSessionobject.
spark.read.parquet("/mnt/tables/sales")
- The above command reads the Parquet files from the specified path and returns a DataFrame.
- The
readmethod returns aDataFrameReaderobject that can be used to read data from various sources.
Read from CSV files
- Read from CSV with the DataFrameReader class’
csvmethod.
StructType and StructField classes.
- The above code reads the CSV file from the specified path and returns a DataFrame with the specified schema.
- The
StructTypeclass is used to define the schema of the DataFrame. - The
StructFieldclass is used to define the fields of the schema. - The
StringType,IntegerType,DoubleType, andDateTypeclasses are used to define the data types of the fields. - The
Trueparameter in theStructFieldconstructor indicates that the field is nullable. - This time there is no need to spin the jobs as we have defined the schema explicitly.
- The
ddl_schemavariable contains the DDL syntax for the schema. - The
schemaparameter in thecsvmethod is set to theddl_schemavariable.
Read from JSON files
- Read from JSON with the DataFrameReader class’
jsonmethod.
- The above code reads the JSON file from the specified path and returns a DataFrame.
- This will spin a job as we have used the
inferSchemaoption. If we explicitly define the schema, there will be no job spun.
StructType Scala method toDDL to convert the schema to DDL syntax. This will provide you the DDL-formatted string. This will be useful if you want to get the DDL formated string for ingesting CSV or JSON. But you don’t want to write the StructType variant of the schema.
This functionality is not available in Python. But you can use the toDDL method in Scala and then copy the DDL-formatted string to Python.