Create DataFramesDifferent ways to create DataFrames in PySpark
1. From a List of Tuples
This method is great for small datasets where rows can be manually defined.
Copy from pyspark . sql import SparkSession
from pyspark . sql . types import StructType , StructField , StringType , IntegerType
spark = SparkSession . builder . appName ( "CreateDataFrames" ). getOrCreate ()
data = [( "Arjun" , 25 , "Mumbai" ) ,
( "Meera" , 30 , "Delhi" ) ,
( "Ravi" , 22 , "Bangalore" )]
schema = StructType ([
StructField ( "Name" , StringType (), True ),
StructField ( "Age" , IntegerType (), True ),
StructField ( "City" , StringType (), True )
])
df = spark . createDataFrame (data, schema)
df . show ()
df . printSchema ()
Output:
Copy +-----+---+---------+
| Name|Age| City|
+-----+---+---------+
|Arjun| 25| Mumbai|
|Meera| 30| Delhi|
| Ravi| 22|Bangalore|
+-----+---+---------+
root
|-- Name: string (nullable = true)
|-- Age: integer (nullable = true)
|-- City: string (nullable = true)
2. From a List of Dictionaries
In this example we are not providing the schema details and Spark automatically infers the schema from the keys and values in dictionary.
Copy
data = [ { "Name" : "Priya" , "Age" : 28 , "City" : "Chennai" },
{ "Name" : "Ramesh" , "Age" : 35 , "City" : "Hyderabad" } ]
df = spark . createDataFrame (data)
df . show ()
df . PrintSchema ()
Output:
You can notice below that the column order in the DataFrame is different than the keys order in the dictionary above. This is because when Spark infers the schema from the dictionary, it may process the keys in a random order.
Copy +---+---------+------+
|Age| City| Name|
+---+---------+------+
| 28| Chennai| Priya|
| 35|Hyderabad|Ramesh|
+---+---------+------+
root
|-- Age: long (nullable = true)
|-- City: string (nullable = true)
|-- Name: string (nullable = true)
Using explicit schema:
Copy from pyspark . sql . types import StructType , StructField , StringType , IntegerType
# Define schema
schema = StructType ([
StructField ( "Name" , StringType (), True ),
StructField ( "Age" , IntegerType (), True ),
StructField ( "City" , StringType (), True )
])
# Create DataFrame with schema
df = spark . createDataFrame (data, schema = schema)
df . show ()
df . printSchema ()
Output:
As we have provided the schema explicitly when creating the DataFrame, the column order and data type of the column is as expected.
Copy +------+---+---------+
| Name|Age| City|
+------+---+---------+
| Priya| 28| Chennai|
|Ramesh| 35|Hyderabad|
+------+---+---------+
root
|-- Name: string (nullable = true)
|-- Age: integer (nullable = true)
|-- City: string (nullable = true)
3. From an RDD
RDDs (Resilient Distributed Datasets) are the foundation of Spark, and you can convert them to DataFrames.
Copy #create RDD
data = [( "Arjun" , 25 , "Mumbai" ) ,
( "Meera" , 30 , "Delhi" ) ,
( "Ravi" , 22 , "Bangalore" )]
columns = [ "Name" , "Age" , "City" ]
rdd = spark . sparkContext . parallelize (data)
df = rdd . toDF (columns) # Convert to a DataFrame
df . show ()
df . printSchema ()
Output:
Copy +-----+---+---------+
| Name|Age| City|
+-----+---+---------+
|Arjun| 25| Mumbai|
|Meera| 30| Delhi|
| Ravi| 22|Bangalore|
+-----+---+---------+
root
|-- Name: string (nullable = true)
|-- Age: long (nullable = true)
|-- City: string (nullable = true)
4. From a Pandas DataFrame
If you already have a pandas DataFrame, you can convert it to a PySpark DataFrame.
Copy import pandas as pd
# Create pandas DataFrame
pandas_df = pd . DataFrame ({
"Name" : [ "Anjali" , "Vikram" , "Sita" ],
"Age" : [ 27 , 32 , 23 ],
"City" : [ "Kolkata" , "Jaipur" , "Lucknow" ]
})
# Convert to PySpark DataFrame
df = spark . createDataFrame (pandas_df)
df . show ()
Output:
Copy +------+---+-------+
| Name|Age| City|
+------+---+-------+
|Anjali| 27|Kolkata|
|Vikram| 32| Jaipur|
| Sita| 23|Lucknow|
+------+---+-------+
root
|-- Name: string (nullable = true)
|-- Age: long (nullable = true)
|-- City: string (nullable = true)
5. From a CSV File
This is useful for loading larger datasets stored in files.
Copy # Load CSV file into DataFrame
df = spark . read . csv ( "path/to/your/file.csv" , header = True , inferSchema = True )
df . show ()
6. From a JSON File
Create a DataFrame directly from a JSON file.
Copy # Load JSON file
df = spark . read . json ( "path/to/your/file.json" )
df . show ()
7. Programmatically with Row Objects
Row objects allow for more structured data creation.
Copy from pyspark . sql import Row
# Create Row objects
data = [ Row (Name = "Suresh" , Age = 26 , City = "Thiruvananthapuram" ),
Row (Name = "Lakshmi" , Age = 31 , City = "Patna" ) ]
# Create DataFrame
df = spark . createDataFrame (data)
df . show ()
8. Using Range Function
Use range
to create a DataFrame with a sequence of numbers.
Copy # Create a DataFrame with numbers from 1 to 4
df = spark . range ( 1 , 5 ). toDF ( "Numbers" )
df . show ()
Output:
Copy +-------+
|Numbers|
+-------+
| 1|
| 2|
| 3|
| 4|
+-------+
9. Using SQL Query on Existing Data
You can create a DataFrame by running an SQL query on an existing table or view.
Copy # Register a temporary view
df . createOrReplaceTempView ( "people" )
# Query the view
result_df = spark . sql ( "SELECT Name, Age FROM people WHERE Age > 25" )
result_df . show ()