The lit()
function in Spark is used to create a new column with a constant or literal value. It is part of the pyspark.sql.functions
module and is particularly useful when you need to add a column with a fixed value to a DataFrame. This function is often used in combination with other transformations, such as withColumn()
.
1. Syntax
PySpark:
from pyspark.sql.functions import lit
lit(value)
2. Parameters
- value: The constant value to be added as a new column. This can be a string, number, boolean, or any other literal value.
3. Return Type
- Returns a
Column
object representing the constant value.
4. Examples
Example 1: Adding a Constant Column to a DataFrame
PySpark:
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit
spark = SparkSession.builder.appName("LitExample").getOrCreate()
data = [("Anand", 25), ("Bala", 30), ("Kavitha", 28), ("Raj", 35)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)
df_with_country = df.withColumn("Country", lit("India"))
df_with_country.show()
Spark SQL:
SELECT *, 'India' AS Country
FROM people;
Output:
+-------+---+-------+
| Name|Age|Country|
+-------+---+-------+
| Anand| 25| India|
| Bala| 30| India|
|Kavitha| 28| India|
| Raj| 35| India|
+-------+---+-------+
Example 2: Adding a Numeric Constant Column
PySpark:
# Add a new column 'Bonus' with a constant value 1000
df_with_bonus = df.withColumn("Bonus", lit(1000))
df_with_bonus.show()
Spark SQL:
SELECT *, 1000 AS Bonus
FROM people;
Output:
+-------+---+-----+
| Name|Age|Bonus|
+-------+---+-----+
| Anand| 25| 1000|
| Bala| 30| 1000|
|Kavitha| 28| 1000|
| Raj| 35| 1000|
+-------+---+-----+
Example 3: Using lit()
in an Expression
PySpark:
from pyspark.sql.functions import col
# Add a new column 'TotalSalary' as the product of 'Age' and a constant value 1000
df_with_total_salary = df.withColumn("TotalSalary", col("Age") * lit(1000))
df_with_total_salary.show()
Spark SQL:
SELECT *, Age * 1000 AS TotalSalary
FROM people;
Output:
+-------+---+------------+
| Name|Age|TotalSalary|
+-------+---+------------+
| Anand| 25| 25000|
| Bala| 30| 30000|
|Kavitha| 28| 28000|
| Raj| 35| 35000|
+-------+---+------------+
Example 4: Adding a Boolean Constant Column
PySpark:
# Add a new column 'IsActive' with a constant value True
df_with_active = df.withColumn("IsActive", lit(True))
df_with_active.show()
Spark SQL:
SELECT *, TRUE AS IsActive
FROM people;
Output:
+-------+---+--------+
| Name|Age|IsActive|
+-------+---+--------+
| Anand| 25| true|
| Bala| 30| true|
|Kavitha| 28| true|
| Raj| 35| true|
+-------+---+--------+
Example 5: Using lit()
with Null Values
PySpark:
# Add a new column 'Manager' with a constant value None (null)
df_with_manager = df.withColumn("Manager", lit(None).cast("string"))
df_with_manager.show()
Spark SQL:
SELECT *, NULL AS Manager
FROM people;
Output:
+-------+---+-------+
| Name|Age|Manager|
+-------+---+-------+
| Anand| 25| null|
| Bala| 30| null|
|Kavitha| 28| null|
| Raj| 35| null|
+-------+---+-------+
Example 6: Using lit()
with Conditional Logic
PySpark:
from pyspark.sql.functions import when
# Add a new column 'Status' with a constant value 'Active' for employees older than 30
df_with_status = df.withColumn("Status",
when(col("Age") > 30, lit("Active"))
.otherwise(lit("Inactive")))
df_with_status.show()
Spark SQL:
SELECT *,
CASE
WHEN Age > 30 THEN 'Active'
ELSE 'Inactive'
END AS Status
FROM people;
Output:
+-------+---+--------+
| Name|Age| Status|
+-------+---+--------+
| Anand| 25|Inactive|
| Bala| 30|Inactive|
|Kavitha| 28|Inactive|
| Raj| 35| Active|
+-------+---+--------+
Example 7: Using lit()
with String Concatenation
PySpark:
from pyspark.sql.functions import concat
# Add a new column 'FullName' by concatenating 'Name' with a constant value ' (Employee)'
df_with_full_name = df.withColumn("FullName", concat(col("Name"), lit(" (Employee)")))
df_with_full_name.show()
Spark SQL:
SELECT *, CONCAT(Name, ' (Employee)') AS FullName
FROM people;
Output:
+-------+---+-------------------+
| Name|Age| FullName|
+-------+---+-------------------+
| Anand| 25| Anand (Employee)|
| Bala| 30| Bala (Employee)|
|Kavitha| 28|Kavitha (Employee)|
| Raj| 35| Raj (Employee)|
+-------+---+-------------------+
Example 8: Using lit()
with Date and Timestamp Values
PySpark:
from pyspark.sql.functions import to_date
# Add a new column 'HireDate' with a constant date value
df_with_hire_date = df.withColumn("HireDate", lit("2023-01-01").cast("date"))
df_with_hire_date.show()
Spark SQL:
SELECT *, CAST('2023-01-01' AS DATE) AS HireDate
FROM people;
Output:
+-------+---+----------+
| Name|Age| HireDate|
+-------+---+----------+
| Anand| 25|2023-01-01|
| Bala| 30|2023-01-01|
|Kavitha| 28|2023-01-01|
| Raj| 35|2023-01-01|
+-------+---+----------+
5. Common Use Cases
- Adding metadata columns (e.g., country, status, created_date).
- Creating derived columns with fixed values (e.g., bonuses, default values).
- Using constant values in complex expressions or transformations.
- Using
lit()
is a metadata operation and does not involve data movement, making it very efficient.
- Combine
lit()
with other functions (e.g., withColumn()
, select()
) for advanced transformations.
7. Key Takeaways
- The
lit()
function is used to create a new column with a constant or literal value.
- It can be used to add columns with string, numeric, boolean, or null values.
- In Spark SQL, similar functionality can be achieved using literal values directly in
SELECT
statements.
- Using
lit()
is lightweight and does not impact performance.
- Works efficiently on large datasets.