show() Function in Spark

The show() function in Spark is used to display the contents of a DataFrame or Dataset in a tabular format. It is one of the most commonly used actions in Spark for debugging and inspecting data. By default, show() displays the first 20 rows of the DataFrame, but you can customize the number of rows and whether to truncate long strings.


1. Syntax

PySpark:

df.show(n=20, truncate=True)

Spark SQL:

  • There is no direct equivalent in Spark SQL, but you can use SELECT * FROM table_name LIMIT n to achieve similar results.

2. Parameters

  • n: The number of rows to display (default is 20).
  • truncate: If True, truncates long strings to 20 characters. If an integer is provided, truncates strings to that length.

3. Key Features

  • Action: show() is an action, meaning it triggers the execution of the Spark job.
  • Tabular Format: Displays the DataFrame in a readable tabular format.
  • Customizable: You can control the number of rows and truncation of long strings.

4. Examples

Example 1: Displaying the First 20 Rows

PySpark:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("ShowExample").getOrCreate()

# Create DataFrame
data = [("Anand", 25), ("Bala", 30), ("Kavitha", 28), ("Raj", 35)]
columns = ["Name", "Age"]

df = spark.createDataFrame(data, columns)

# Display the first 20 rows
df.show()

Output:

+-------+---+
|   Name|Age|
+-------+---+
|  Anand| 25|
|   Bala| 30|
|Kavitha| 28|
|    Raj| 35|
+-------+---+

Example 2: Displaying a Specific Number of Rows

PySpark:

# Display the first 2 rows
df.show(2)

Output:

+-----+---+
| Name|Age|
+-----+---+
|Anand| 25|
| Bala| 30|
+-----+---+
only showing top 2 rows

Example 3: Disabling Truncation

PySpark:

# Create DataFrame with long strings
data = [("Anand", "This is a very long string that will be truncated"), 
        ("Bala", "Another long string that will be truncated")]
columns = ["Name", "Description"]

df = spark.createDataFrame(data, columns)

# Display without truncation
df.show(truncate=False)

Output:

+-----+---------------------------------------------+
|Name |Description                                  |
+-----+---------------------------------------------+
|Anand|This is a very long string that will be truncated|
|Bala |Another long string that will be truncated   |
+-----+---------------------------------------------+

Example 4: Displaying All Columns Without Truncation

PySpark:

# Display all columns without truncation
df.show(truncate=False)

Output:

+-------+---------------------------------------------+
|Name   |Description                                  |
+-------+---------------------------------------------+
|Anand  |This is a very long string that will be truncated|
|Bala   |Another long string that will be truncated   |
+-------+---------------------------------------------+

Example 5: Displaying a Subset of Columns

PySpark:

# Display only the 'Name' column
df.select("Name").show()

Output:

+-------+
|   Name|
+-------+
|  Anand|
|   Bala|
|Kavitha|
|    Raj|
+-------+

Example 6: Displaying Data with Custom Truncation Length

PySpark:

# Display with custom truncation length (e.g., 10 characters)
df.show(truncate=10)

Output:

+-------+-----------+
|   Name| Description|
+-------+-----------+
|  Anand|This is a...|
|   Bala|Another ...|
+-------+-----------+

Example 7: Displaying Data with Vertical Format

PySpark:

# Display data in vertical format
df.show(vertical=True)

Output:

-RECORD 0-------------------
 Name        | Anand         
 Description | This is a very long string that will be truncated
-RECORD 1-------------------
 Name        | Bala          
 Description | Another long string that will be truncated

5. Common Use Cases

  • Debugging and inspecting data during development.
  • Displaying sample data for analysis or reporting.
  • Verifying the results of transformations or aggregations.

6. Performance Considerations

  • Execution Overhead: show() triggers the execution of the entire DataFrame lineage, so use it carefully for large datasets.
  • Truncation: By default, long strings are truncated to 20 characters. Use truncate=False to display full strings.

7. Key Takeaways

  1. Purpose: The show() function is used to display the contents of a DataFrame or Dataset in a tabular format.
  2. Action: It is an action that triggers the execution of the Spark job.
  3. In Spark SQL, similar functionality can be achieved using SELECT * FROM table_name LIMIT n.