Spark: lit function

The lit() function in Spark is used to create a new column with a constant or literal value. It is part of the pyspark.sql.functions module and is particularly useful when you need to add a column with a fixed value to a DataFrame. This function is often used in combination with other transformations, such as withColumn().

1. Syntax

PySpark:

from pyspark.sql.functions import lit

lit(value)

2. Parameters

value: The constant value to be added as a new column. This can be a string, number, boolean, or any other literal value.

3. Return Type

Returns a Column object representing the constant value.

4. Examples

Example 1: Adding a Constant Column to a DataFrame

PySpark:

from pyspark.sql import SparkSession
from pyspark.sql.functions import lit

spark = SparkSession.builder.appName("LitExample").getOrCreate()

data = [("Anand", 25), ("Bala", 30), ("Kavitha", 28), ("Raj", 35)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)

df_with_country = df.withColumn("Country", lit("India"))
df_with_country.show()

Spark SQL:

SELECT *, 'India' AS Country 
FROM people;

Output:

+-------+---+-------+
|   Name|Age|Country|
+-------+---+-------+
|  Anand| 25|  India|
|   Bala| 30|  India|
|Kavitha| 28|  India|
|    Raj| 35|  India|
+-------+---+-------+

Example 2: Adding a Numeric Constant Column

PySpark:

# Add a new column 'Bonus' with a constant value 1000
df_with_bonus = df.withColumn("Bonus", lit(1000))
df_with_bonus.show()

Spark SQL:

SELECT *, 1000 AS Bonus 
FROM people;

Output:

+-------+---+-----+
|   Name|Age|Bonus|
+-------+---+-----+
|  Anand| 25| 1000|
|   Bala| 30| 1000|
|Kavitha| 28| 1000|
|    Raj| 35| 1000|
+-------+---+-----+

Example 3: Using `lit()` in an Expression

PySpark:

from pyspark.sql.functions import col

# Add a new column 'TotalSalary' as the product of 'Age' and a constant value 1000
df_with_total_salary = df.withColumn("TotalSalary", col("Age") * lit(1000))
df_with_total_salary.show()

Spark SQL:

SELECT *, Age * 1000 AS TotalSalary 
FROM people;

Output:

+-------+---+------------+
|   Name|Age|TotalSalary|
+-------+---+------------+
|  Anand| 25|       25000|
|   Bala| 30|       30000|
|Kavitha| 28|       28000|
|    Raj| 35|       35000|
+-------+---+------------+

Example 4: Adding a Boolean Constant Column

PySpark:

# Add a new column 'IsActive' with a constant value True
df_with_active = df.withColumn("IsActive", lit(True))
df_with_active.show()

Spark SQL:

SELECT *, TRUE AS IsActive 
FROM people;

Output:

+-------+---+--------+
|   Name|Age|IsActive|
+-------+---+--------+
|  Anand| 25|    true|
|   Bala| 30|    true|
|Kavitha| 28|    true|
|    Raj| 35|    true|
+-------+---+--------+

Example 5: Using `lit()` with Null Values

PySpark:

# Add a new column 'Manager' with a constant value None (null)
df_with_manager = df.withColumn("Manager", lit(None).cast("string"))
df_with_manager.show()

Spark SQL:

SELECT *, NULL AS Manager 
FROM people;

Output:

+-------+---+-------+
|   Name|Age|Manager|
+-------+---+-------+
|  Anand| 25|   null|
|   Bala| 30|   null|
|Kavitha| 28|   null|
|    Raj| 35|   null|
+-------+---+-------+

Example 6: Using `lit()` with Conditional Logic

PySpark:

from pyspark.sql.functions import when

# Add a new column 'Status' with a constant value 'Active' for employees older than 30
df_with_status = df.withColumn("Status", 
                               when(col("Age") > 30, lit("Active"))
                               .otherwise(lit("Inactive")))
df_with_status.show()

Spark SQL:

SELECT *, 
       CASE 
           WHEN Age > 30 THEN 'Active' 
           ELSE 'Inactive' 
       END AS Status 
FROM people;

Output:

+-------+---+--------+
|   Name|Age|  Status|
+-------+---+--------+
|  Anand| 25|Inactive|
|   Bala| 30|Inactive|
|Kavitha| 28|Inactive|
|    Raj| 35|  Active|
+-------+---+--------+

Example 7: Using `lit()` with String Concatenation

PySpark:

from pyspark.sql.functions import concat

# Add a new column 'FullName' by concatenating 'Name' with a constant value ' (Employee)'
df_with_full_name = df.withColumn("FullName", concat(col("Name"), lit(" (Employee)")))
df_with_full_name.show()

Spark SQL:

SELECT *, CONCAT(Name, ' (Employee)') AS FullName 
FROM people;

Output:

+-------+---+-------------------+
|   Name|Age|           FullName|
+-------+---+-------------------+
|  Anand| 25|  Anand (Employee)|
|   Bala| 30|   Bala (Employee)|
|Kavitha| 28|Kavitha (Employee)|
|    Raj| 35|    Raj (Employee)|
+-------+---+-------------------+

Example 8: Using `lit()` with Date and Timestamp Values

PySpark:

from pyspark.sql.functions import to_date

# Add a new column 'HireDate' with a constant date value
df_with_hire_date = df.withColumn("HireDate", lit("2023-01-01").cast("date"))
df_with_hire_date.show()

Spark SQL:

SELECT *, CAST('2023-01-01' AS DATE) AS HireDate 
FROM people;

Output:

+-------+---+----------+
|   Name|Age|  HireDate|
+-------+---+----------+
|  Anand| 25|2023-01-01|
|   Bala| 30|2023-01-01|
|Kavitha| 28|2023-01-01|
|    Raj| 35|2023-01-01|
+-------+---+----------+

5. Common Use Cases

Adding metadata columns (e.g., country, status, created_date).
Creating derived columns with fixed values (e.g., bonuses, default values).
Using constant values in complex expressions or transformations.

6. Performance Considerations

Using lit() is a metadata operation and does not involve data movement, making it very efficient.
Combine lit() with other functions (e.g., withColumn(), select()) for advanced transformations.

7. Key Takeaways

The lit() function is used to create a new column with a constant or literal value.
It can be used to add columns with string, numeric, boolean, or null values.
In Spark SQL, similar functionality can be achieved using literal values directly in SELECT statements.
Using lit() is lightweight and does not impact performance.
Works efficiently on large datasets.

​1. Syntax

​2. Parameters

​3. Return Type

​4. Examples

​Example 1: Adding a Constant Column to a DataFrame

​Example 2: Adding a Numeric Constant Column

​Example 3: Using lit() in an Expression

​Example 4: Adding a Boolean Constant Column

​Example 5: Using lit() with Null Values

​Example 6: Using lit() with Conditional Logic

​Example 7: Using lit() with String Concatenation

​Example 8: Using lit() with Date and Timestamp Values

​5. Common Use Cases

​6. Performance Considerations

​7. Key Takeaways

1. Syntax

2. Parameters

3. Return Type

4. Examples

Example 1: Adding a Constant Column to a DataFrame

Example 2: Adding a Numeric Constant Column

Example 3: Using `lit()` in an Expression

Example 4: Adding a Boolean Constant Column

Example 5: Using `lit()` with Null Values

Example 6: Using `lit()` with Conditional Logic

Example 7: Using `lit()` with String Concatenation

Example 8: Using `lit()` with Date and Timestamp Values

5. Common Use Cases

6. Performance Considerations

7. Key Takeaways