Spark: Data Types
Data Type | Example | Use Case | When to Use | Range of Values |
---|---|---|---|---|
ByteType | 1, 127, -128 | Small integers, memory efficiency is key | Small datasets like age, small integer IDs, counters | -128 to 127 |
ShortType | 100, 32767, -100 | Slightly larger integers | Small integer IDs, counters where ByteType’s range is insufficient | -32768 to 32767 |
IntegerType | 1000, 2147483647 | Most common integers | Default for integers unless larger range or memory efficiency is paramount | -2,147,483,648 to 2,147,483,647 |
LongType | 1e9, 9223372036854775807 | Very large integers | Timestamps (milliseconds), large IDs, counters | -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807 |
FloatType | 3.14, -2.718 | Single-precision floating-point numbers | Floating-point numbers where memory efficiency is a concern, scientific data | Approximately ±3.4 × 10−38 to ±3.4 × 1038 |
DoubleType | 3.14159, -2.71828 | Double-precision floating-point numbers | Default for floating-point numbers, higher precision needed | Approximately ±4.9 × 10−324 to ±1.8 × 10308 |
DecimalType | 123.45, 0.0001 | Arbitrary-precision decimal numbers | High precision for financial/scientific calculations | Arbitrary precision (defined by precision and scale) |
StringType | 'hello', 'world' | Text data | Any textual information | Variable length, limited only by available memory |
BooleanType | true, false | True/false values | Flags, indicators, binary choices | true , false |
DateType | 2024-01-27 | Dates | Storing and manipulating dates | YYYY-MM-DD (date only) |
TimestampType | 2024-01-27 10:00 | Dates and times | Storing and manipulating timestamps | YYYY-MM-DD HH:mm:ss[.fffffffff] (date and time) |
BinaryType | b'data' | Raw binary data | Images, audio, other non-textual binary data | Variable length, sequence of bytes |
ArrayType | [1,2,3] | Lists/arrays of values | Single column storing multiple values | Variable length, all elements of the same type |
MapType | {'a':1, 'b':2} | Key-value pairs | Storing structured data where each key maps to a value | Variable length, keys and values of the same type respectively |
StructType | {'name':'Bob'} | Complex data structures | Storing nested/hierarchical data | Variable length, defined by the schema |
YearMonthIntervalType | ||||
DayTimeIntervalType |
Reference: Spark SQL Data Types
Compare Data Types in Spark with SQL Server
Spark Data Type | SQL Server Data Type | Notes |
---|---|---|
ByteType | tinyint | Direct equivalent. |
ShortType | smallint | Direct equivalent. |
IntegerType | int | Direct equivalent. |
LongType | bigint | Direct equivalent. |
FloatType | real | Direct equivalent. |
DoubleType | float | Direct equivalent. |
DecimalType | decimal , numeric | Requires matching precision and scale. |
StringType | varchar , nvarchar , char , nchar | Choose based on length and character encoding (Unicode or not). |
BooleanType | bit | Direct equivalent. |
DateType | date | Direct equivalent. |
TimestampType | datetime2 , datetimeoffset | datetime2 is generally preferred for better precision. datetimeoffset handles time zones. |
BinaryType | varbinary , binary | Choose based on length. |
ArrayType | table (with appropriate schema) | Requires creating a separate table to represent the array. No direct equivalent. |
MapType | table (with appropriate schema) | Requires creating a separate table to represent the map. No direct equivalent. |
StructType | table (with appropriate schema) | Requires creating a separate table to represent the struct. No direct equivalent. |
QnA
Q: Which numeric data type should be used if you want to avoid rounding error?
Q: Which numeric data type should be used if you want to avoid rounding error?
DecimalType’s range is completely determined by the precision and scale you specify when defining it. This allows you to avoid rounding errors that can occur with floating-point types (FloatType, DoubleType).
Q: What is the maximum number of characters can be stored in StringType?
Q: What is the maximum number of characters can be stored in StringType?
StringType has variable lengths limited only by available memory. You can store very large strings in StringType columns.