read
, csv
, json
, parquet
, table
, format
, load
, option
, schema
write
, csv
, json
, parquet
, table
, format
, save
, option
, mode
, saveAsTable
,StructType
, StructField
, StringType
, IntegerType
, DoubleType
, DateType
, TimestampType
, BooleanType
DataFrameReader
class is used to read data from various sources (e.g., CSV, JSON, Parquet, Delta) into a DataFrame.DataFrameReader
class is available as the read
attribute of the SparkSession
object.spark.read.parquet("/mnt/tables/sales")
read
method returns a DataFrameReader
object that can be used to read data from various sources.csv
method.StructType
and StructField
classes.
StructType
class is used to define the schema of the DataFrame.StructField
class is used to define the fields of the schema.StringType
, IntegerType
, DoubleType
, and DateType
classes are used to define the data types of the fields.True
parameter in the StructField
constructor indicates that the field is nullable.ddl_schema
variable contains the DDL syntax for the schema.schema
parameter in the csv
method is set to the ddl_schema
variable.json
method.inferSchema
option. If we explicitly define the schema, there will be no job spun.StructType
Scala method toDDL
to convert the schema to DDL syntax. This will provide you the DDL-formatted string. This will be useful if you want to get the DDL formated string for ingesting CSV or JSON. But you don’t want to write the StructType variant of the schema.
This functionality is not available in Python. But you can use the toDDL
method in Scala and then copy the DDL-formatted string to Python.