The lit() function in Spark is used to create a new column with a constant or literal value. It is part of the pyspark.sql.functions module and is particularly useful when you need to add a column with a fixed value to a DataFrame. This function is often used in combination with other transformations, such as withColumn().


1. Syntax

PySpark:

from pyspark.sql.functions import lit

lit(value)

2. Parameters

  • value: The constant value to be added as a new column. This can be a string, number, boolean, or any other literal value.

3. Return Type

  • Returns a Column object representing the constant value.

4. Examples

Example 1: Adding a Constant Column to a DataFrame

PySpark:

from pyspark.sql import SparkSession
from pyspark.sql.functions import lit

spark = SparkSession.builder.appName("LitExample").getOrCreate()

data = [("Anand", 25), ("Bala", 30), ("Kavitha", 28), ("Raj", 35)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)

df_with_country = df.withColumn("Country", lit("India"))
df_with_country.show()

Spark SQL:

SELECT *, 'India' AS Country 
FROM people;

Output:

+-------+---+-------+
|   Name|Age|Country|
+-------+---+-------+
|  Anand| 25|  India|
|   Bala| 30|  India|
|Kavitha| 28|  India|
|    Raj| 35|  India|
+-------+---+-------+

Example 2: Adding a Numeric Constant Column

PySpark:

# Add a new column 'Bonus' with a constant value 1000
df_with_bonus = df.withColumn("Bonus", lit(1000))
df_with_bonus.show()

Spark SQL:

SELECT *, 1000 AS Bonus 
FROM people;

Output:

+-------+---+-----+
|   Name|Age|Bonus|
+-------+---+-----+
|  Anand| 25| 1000|
|   Bala| 30| 1000|
|Kavitha| 28| 1000|
|    Raj| 35| 1000|
+-------+---+-----+

Example 3: Using lit() in an Expression

PySpark:

from pyspark.sql.functions import col

# Add a new column 'TotalSalary' as the product of 'Age' and a constant value 1000
df_with_total_salary = df.withColumn("TotalSalary", col("Age") * lit(1000))
df_with_total_salary.show()

Spark SQL:

SELECT *, Age * 1000 AS TotalSalary 
FROM people;

Output:

+-------+---+------------+
|   Name|Age|TotalSalary|
+-------+---+------------+
|  Anand| 25|       25000|
|   Bala| 30|       30000|
|Kavitha| 28|       28000|
|    Raj| 35|       35000|
+-------+---+------------+

Example 4: Adding a Boolean Constant Column

PySpark:

# Add a new column 'IsActive' with a constant value True
df_with_active = df.withColumn("IsActive", lit(True))
df_with_active.show()

Spark SQL:

SELECT *, TRUE AS IsActive 
FROM people;

Output:

+-------+---+--------+
|   Name|Age|IsActive|
+-------+---+--------+
|  Anand| 25|    true|
|   Bala| 30|    true|
|Kavitha| 28|    true|
|    Raj| 35|    true|
+-------+---+--------+

Example 5: Using lit() with Null Values

PySpark:

# Add a new column 'Manager' with a constant value None (null)
df_with_manager = df.withColumn("Manager", lit(None).cast("string"))
df_with_manager.show()

Spark SQL:

SELECT *, NULL AS Manager 
FROM people;

Output:

+-------+---+-------+
|   Name|Age|Manager|
+-------+---+-------+
|  Anand| 25|   null|
|   Bala| 30|   null|
|Kavitha| 28|   null|
|    Raj| 35|   null|
+-------+---+-------+

Example 6: Using lit() with Conditional Logic

PySpark:

from pyspark.sql.functions import when

# Add a new column 'Status' with a constant value 'Active' for employees older than 30
df_with_status = df.withColumn("Status", 
                               when(col("Age") > 30, lit("Active"))
                               .otherwise(lit("Inactive")))
df_with_status.show()

Spark SQL:

SELECT *, 
       CASE 
           WHEN Age > 30 THEN 'Active' 
           ELSE 'Inactive' 
       END AS Status 
FROM people;

Output:

+-------+---+--------+
|   Name|Age|  Status|
+-------+---+--------+
|  Anand| 25|Inactive|
|   Bala| 30|Inactive|
|Kavitha| 28|Inactive|
|    Raj| 35|  Active|
+-------+---+--------+

Example 7: Using lit() with String Concatenation

PySpark:

from pyspark.sql.functions import concat

# Add a new column 'FullName' by concatenating 'Name' with a constant value ' (Employee)'
df_with_full_name = df.withColumn("FullName", concat(col("Name"), lit(" (Employee)")))
df_with_full_name.show()

Spark SQL:

SELECT *, CONCAT(Name, ' (Employee)') AS FullName 
FROM people;

Output:

+-------+---+-------------------+
|   Name|Age|           FullName|
+-------+---+-------------------+
|  Anand| 25|  Anand (Employee)|
|   Bala| 30|   Bala (Employee)|
|Kavitha| 28|Kavitha (Employee)|
|    Raj| 35|    Raj (Employee)|
+-------+---+-------------------+

Example 8: Using lit() with Date and Timestamp Values

PySpark:

from pyspark.sql.functions import to_date

# Add a new column 'HireDate' with a constant date value
df_with_hire_date = df.withColumn("HireDate", lit("2023-01-01").cast("date"))
df_with_hire_date.show()

Spark SQL:

SELECT *, CAST('2023-01-01' AS DATE) AS HireDate 
FROM people;

Output:

+-------+---+----------+
|   Name|Age|  HireDate|
+-------+---+----------+
|  Anand| 25|2023-01-01|
|   Bala| 30|2023-01-01|
|Kavitha| 28|2023-01-01|
|    Raj| 35|2023-01-01|
+-------+---+----------+

5. Common Use Cases

  • Adding metadata columns (e.g., country, status, created_date).
  • Creating derived columns with fixed values (e.g., bonuses, default values).
  • Using constant values in complex expressions or transformations.

6. Performance Considerations

  • Using lit() is a metadata operation and does not involve data movement, making it very efficient.
  • Combine lit() with other functions (e.g., withColumn(), select()) for advanced transformations.

7. Key Takeaways

  1. The lit() function is used to create a new column with a constant or literal value.
  2. It can be used to add columns with string, numeric, boolean, or null values.
  3. In Spark SQL, similar functionality can be achieved using literal values directly in SELECT statements.
  4. Using lit() is lightweight and does not impact performance.
  5. Works efficiently on large datasets.