Create DataFrames

Different ways to create DataFrames in PySpark

1. From a List of Tuples

This method is great for small datasets where rows can be manually defined.

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

spark = SparkSession.builder.appName("CreateDataFrames").getOrCreate()

data = [("Arjun", 25, "Mumbai"),
        ("Meera", 30, "Delhi"),
        ("Ravi", 22, "Bangalore")]

schema = StructType([
    StructField("Name", StringType(), True),  
    StructField("Age", IntegerType(), True), 
    StructField("City", StringType(), True) 
])

df = spark.createDataFrame(data, schema)
df.show()
df.printSchema()

Output:

+-----+---+---------+
| Name|Age|     City|
+-----+---+---------+
|Arjun| 25|   Mumbai|
|Meera| 30|    Delhi|
| Ravi| 22|Bangalore|
+-----+---+---------+

root
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- City: string (nullable = true)

2. From a List of Dictionaries

In this example we are not providing the schema details and Spark automatically infers the schema from the keys and values in dictionary.


data = [{"Name": "Priya", "Age": 28, "City": "Chennai"},
        {"Name": "Ramesh", "Age": 35, "City": "Hyderabad"}]

df = spark.createDataFrame(data)
df.show()
df.PrintSchema()

Output:

You can notice below that the column order in the DataFrame is different than the keys order in the dictionary above. This is because when Spark infers the schema from the dictionary, it may process the keys in a random order.

+---+---------+------+
|Age|     City|  Name|
+---+---------+------+
| 28|  Chennai| Priya|
| 35|Hyderabad|Ramesh|
+---+---------+------+

root
 |-- Age: long (nullable = true)
 |-- City: string (nullable = true)
 |-- Name: string (nullable = true)

Using explicit schema:

from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# Define schema
schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Age", IntegerType(), True),
    StructField("City", StringType(), True)
])

# Create DataFrame with schema
df = spark.createDataFrame(data, schema=schema)
df.show()
df.printSchema()

Output:

As we have provided the schema explicitly when creating the DataFrame, the column order and data type of the column is as expected.

+------+---+---------+
|  Name|Age|     City|
+------+---+---------+
| Priya| 28|  Chennai|
|Ramesh| 35|Hyderabad|
+------+---+---------+

root
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- City: string (nullable = true)

3. From an RDD

RDDs (Resilient Distributed Datasets) are the foundation of Spark, and you can convert them to DataFrames.

#create RDD
data = [("Arjun", 25, "Mumbai"),
        ("Meera", 30, "Delhi"),
        ("Ravi", 22, "Bangalore")]
columns = ["Name", "Age", "City"]
rdd = spark.sparkContext.parallelize(data)

df = rdd.toDF(columns) # Convert to a DataFrame
df.show()
df.printSchema()

Output:

+-----+---+---------+
| Name|Age|     City|
+-----+---+---------+
|Arjun| 25|   Mumbai|
|Meera| 30|    Delhi|
| Ravi| 22|Bangalore|
+-----+---+---------+

root
 |-- Name: string (nullable = true)
 |-- Age: long (nullable = true)
 |-- City: string (nullable = true)

4. From a Pandas DataFrame

If you already have a pandas DataFrame, you can convert it to a PySpark DataFrame.

import pandas as pd

# Create pandas DataFrame
pandas_df = pd.DataFrame({
    "Name": ["Anjali", "Vikram", "Sita"],
    "Age": [27, 32, 23],
    "City": ["Kolkata", "Jaipur", "Lucknow"]
})

# Convert to PySpark DataFrame
df = spark.createDataFrame(pandas_df)
df.show()

Output:

+------+---+-------+
|  Name|Age|   City|
+------+---+-------+
|Anjali| 27|Kolkata|
|Vikram| 32| Jaipur|
|  Sita| 23|Lucknow|
+------+---+-------+

root
 |-- Name: string (nullable = true)
 |-- Age: long (nullable = true)
 |-- City: string (nullable = true)

5. From a CSV File

This is useful for loading larger datasets stored in files.

# Load CSV file into DataFrame
df = spark.read.csv("path/to/your/file.csv", header=True, inferSchema=True)
df.show()

6. From a JSON File

Create a DataFrame directly from a JSON file.

# Load JSON file
df = spark.read.json("path/to/your/file.json")
df.show()

7. Programmatically with Row Objects

Row objects allow for more structured data creation.

from pyspark.sql import Row

# Create Row objects
data = [Row(Name="Suresh", Age=26, City="Thiruvananthapuram"),
        Row(Name="Lakshmi", Age=31, City="Patna")]

# Create DataFrame
df = spark.createDataFrame(data)
df.show()

8. Using Range Function

Use range to create a DataFrame with a sequence of numbers.

# Create a DataFrame with numbers from 1 to 4
df = spark.range(1, 5).toDF("Numbers")
df.show()

Output:

+-------+
|Numbers|
+-------+
|      1|
|      2|
|      3|
|      4|
+-------+

9. Using SQL Query on Existing Data

You can create a DataFrame by running an SQL query on an existing table or view.

# Register a temporary view
df.createOrReplaceTempView("people")

# Query the view
result_df = spark.sql("SELECT Name, Age FROM people WHERE Age > 25")
result_df.show()

Last updated