Create DataFrames in Spark
1. From a List of Tuples
This method is great for small datasets where rows can be manually defined.
Output:
2. From a List of Dictionaries
In this example we are not providing the schema details and Spark automatically infers the schema from the keys and values in dictionary.
Output:
You can notice below that the column order in the DataFrame is different than the keys order in the dictionary above. This is because when Spark infers the schema from the dictionary, it may process the keys in a random order.
Using explicit schema:
Output:
As we have provided the schema explicitly when creating the DataFrame, the column order and data type of the column is as expected.
3. From an RDD
RDDs (Resilient Distributed Datasets) are the foundation of Spark, and you can convert them to DataFrames.
Output:
4. From a Pandas DataFrame
If you already have a pandas DataFrame, you can convert it to a PySpark DataFrame.
Output:
5. From a CSV File
This is useful for loading larger datasets stored in files.
6. From a JSON File
Create a DataFrame directly from a JSON file.
7. Programmatically with Row Objects
Row objects allow for more structured data creation.
8. Using Range Function
Use range
to create a DataFrame with a sequence of numbers.
Output:
9. Using SQL Query on Existing Data
You can create a DataFrame by running an SQL query on an existing table or view.