Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-463: Add more types - time, nano timestamps, UUID to Variant spec #464

Merged
merged 5 commits into from
Dec 10, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 32 additions & 23 deletions VariantEncoding.md
Original file line number Diff line number Diff line change
Expand Up @@ -365,6 +365,7 @@ It is semantically identical to the "string" primitive type.
The Decimal type contains a scale, but no precision. The implied precision of a decimal value is `floor(log_10(val)) + 1`.

# Encoding types
*Variant basic types*

| Basic Type | ID | Description |
|--------------|-----|---------------------------------------------------|
Expand All @@ -373,25 +374,37 @@ The Decimal type contains a scale, but no precision. The implied precision of a
| Object | `2` | A collection of (string-key, variant-value) pairs |
| Array | `3` | An ordered sequence of variant values |

| Logical Type | Physical Type | Type ID | Equivalent Parquet Type | Binary format |
|----------------------|-----------------------------|---------|-----------------------------|---------------------------------------------------------------------------------------------------------------------|
| NullType | null | `0` | any | none |
| Boolean | boolean (True) | `1` | BOOLEAN | none |
| Boolean | boolean (False) | `2` | BOOLEAN | none |
| Exact Numeric | int8 | `3` | INT(8, signed) | 1 byte |
| Exact Numeric | int16 | `4` | INT(16, signed) | 2 byte little-endian |
| Exact Numeric | int32 | `5` | INT(32, signed) | 4 byte little-endian |
| Exact Numeric | int64 | `6` | INT(64, signed) | 8 byte little-endian |
| Double | double | `7` | DOUBLE | IEEE little-endian |
| Exact Numeric | decimal4 | `8` | DECIMAL(precision, scale) | 1 byte scale in range [0, 38], followed by little-endian unscaled value (see decimal table) |
| Exact Numeric | decimal8 | `9` | DECIMAL(precision, scale) | 1 byte scale in range [0, 38], followed by little-endian unscaled value (see decimal table) |
| Exact Numeric | decimal16 | `10` | DECIMAL(precision, scale) | 1 byte scale in range [0, 38], followed by little-endian unscaled value (see decimal table) |
| Date | date | `11` | DATE | 4 byte little-endian |
| Timestamp | timestamp | `12` | TIMESTAMP(true, MICROS) | 8-byte little-endian |
| TimestampNTZ | timestamp without time zone | `13` | TIMESTAMP(false, MICROS) | 8-byte little-endian |
| Float | float | `14` | FLOAT | IEEE little-endian |
| Binary | binary | `15` | BINARY | 4 byte little-endian size, followed by bytes |
| String | string | `16` | STRING | 4 byte little-endian size, followed by UTF-8 encoded bytes |
*Variant primitive types*

| Type Equivalence Class | Physical Type | Type ID | Equivalent Parquet Type | Binary format |
|----------------------|-----------------------------|---------|-----------------------------|---------------------------------------------------------------------------------------------|
| NullType | null | `0` | any | none |
| Boolean | boolean (True) | `1` | BOOLEAN | none |
| Boolean | boolean (False) | `2` | BOOLEAN | none |
| Exact Numeric | int8 | `3` | INT(8, signed) | 1 byte |
| Exact Numeric | int16 | `4` | INT(16, signed) | 2 byte little-endian |
| Exact Numeric | int32 | `5` | INT(32, signed) | 4 byte little-endian |
| Exact Numeric | int64 | `6` | INT(64, signed) | 8 byte little-endian |
| Double | double | `7` | DOUBLE | IEEE little-endian |
| Exact Numeric | decimal4 | `8` | DECIMAL(precision, scale) | 1 byte scale in range [0, 38], followed by little-endian unscaled value (see decimal table) |
| Exact Numeric | decimal8 | `9` | DECIMAL(precision, scale) | 1 byte scale in range [0, 38], followed by little-endian unscaled value (see decimal table) |
| Exact Numeric | decimal16 | `10` | DECIMAL(precision, scale) | 1 byte scale in range [0, 38], followed by little-endian unscaled value (see decimal table) |
| Date | date | `11` | DATE | 4 byte little-endian |
| Timestamp | timestamp with time zone | `12` | TIMESTAMP(isAdjustedToUTC=true, MICROS) | 8-byte little-endian |
| TimestampNTZ | timestamp without time zone | `13` | TIMESTAMP(isAdjustedToUTC=false, MICROS) | 8-byte little-endian |
| Float | float | `14` | FLOAT | IEEE little-endian |
| Binary | binary | `15` | BINARY | 4 byte little-endian size, followed by bytes |
| String | string | `16` | STRING | 4 byte little-endian size, followed by UTF-8 encoded bytes |
| TimeNTZ | time without time zone | `21` | TIME(isAdjustedToUTC=false, MICROS) | 8-byte little-endian |
| Timestamp | timestamp with time zone | `22` | TIMESTAMP(isAdjustedToUTC=true, NANOS) | 8-byte little-endian |
| TimestampNTZ | timestamp without time zone | `23` | TIMESTAMP(isAdjustedToUTC=false, NANOS) | 8-byte little-endian |
| UUID | uuid | `24` | UUID | 16-byte big-endian |

The *Type Equivalence Class* column indicates logical equivalence of physically encoded types.
For example, a user expression operating on a string value containing "hello" should behave the same, whether it is encoded with the short string optimization, or long string encoding.
Similarly, user expressions operating on an *int8* value of 1 should behave the same as a decimal16 with scale 2 and unscaled value 100.

*Decimal table*

| Decimal Precision | Decimal value type |
|-----------------------|--------------------|
Expand All @@ -400,10 +413,6 @@ The Decimal type contains a scale, but no precision. The implied precision of a
| 18 <= precision <= 38 | int128 |
| > 38 | Not supported |

The *Logical Type* column indicates logical equivalence of physically encoded types.
For example, a user expression operating on a string value containing "hello" should behave the same, whether it is encoded with the short string optimization, or long string encoding.
Similarly, user expressions operating on an *int8* value of 1 should behave the same as a decimal16 with scale 2 and unscaled value 100.

# String values must be UTF-8 encoded

All strings within the Variant binary format must be UTF-8 encoded.
Expand Down
Loading