Spark: show function
show()
Function in Spark
The show()
function in Spark is used to display the contents of a DataFrame or Dataset in a tabular format. It is one of the most commonly used actions in Spark for debugging and inspecting data. By default, show()
displays the first 20 rows of the DataFrame, but you can customize the number of rows and whether to truncate long strings.
1. Syntax
PySpark:
Spark SQL:
- There is no direct equivalent in Spark SQL, but you can use
SELECT * FROM table_name LIMIT n
to achieve similar results.
2. Parameters
- n: The number of rows to display (default is 20).
- truncate: If
True
, truncates long strings to 20 characters. If an integer is provided, truncates strings to that length.
3. Key Features
- Action:
show()
is an action, meaning it triggers the execution of the Spark job. - Tabular Format: Displays the DataFrame in a readable tabular format.
- Customizable: You can control the number of rows and truncation of long strings.
4. Examples
Example 1: Displaying the First 20 Rows
PySpark:
Output:
Example 2: Displaying a Specific Number of Rows
PySpark:
Output:
Example 3: Disabling Truncation
PySpark:
Output:
Example 4: Displaying All Columns Without Truncation
PySpark:
Output:
Example 5: Displaying a Subset of Columns
PySpark:
Output:
Example 6: Displaying Data with Custom Truncation Length
PySpark:
Output:
Example 7: Displaying Data with Vertical Format
PySpark:
Output:
5. Common Use Cases
- Debugging and inspecting data during development.
- Displaying sample data for analysis or reporting.
- Verifying the results of transformations or aggregations.
6. Performance Considerations
- Execution Overhead:
show()
triggers the execution of the entire DataFrame lineage, so use it carefully for large datasets. - Truncation: By default, long strings are truncated to 20 characters. Use
truncate=False
to display full strings.
7. Key Takeaways
- Purpose: The
show()
function is used to display the contents of a DataFrame or Dataset in a tabular format. - Action: It is an action that triggers the execution of the Spark job.
- In Spark SQL, similar functionality can be achieved using
SELECT * FROM table_name LIMIT n
.