Skip to content

Commit

Permalink
GH-463: Add more types - time, nano timestamps, UUID to Variant spec (#…
Browse files Browse the repository at this point in the history
…464)

* Add more types - time, nano timestamps, UUID to Variant.

* Update type names to align with Parquet logical type

* Update logical type

* Update VariantEncoding.md

Co-authored-by: emkornfield <[email protected]>

* Update VariantEncoding.md

Co-authored-by: emkornfield <[email protected]>

---------

Co-authored-by: emkornfield <[email protected]>
  • Loading branch information
aihuaxu and emkornfield authored Dec 10, 2024
1 parent 5740bf1 commit a3dda6a
Showing 1 changed file with 32 additions and 23 deletions.
55 changes: 32 additions & 23 deletions VariantEncoding.md
Original file line number Diff line number Diff line change
Expand Up @@ -365,6 +365,7 @@ It is semantically identical to the "string" primitive type.
The Decimal type contains a scale, but no precision. The implied precision of a decimal value is `floor(log_10(val)) + 1`.

# Encoding types
*Variant basic types*

| Basic Type | ID | Description |
|--------------|-----|---------------------------------------------------|
Expand All @@ -373,25 +374,37 @@ The Decimal type contains a scale, but no precision. The implied precision of a
| Object | `2` | A collection of (string-key, variant-value) pairs |
| Array | `3` | An ordered sequence of variant values |

| Logical Type | Physical Type | Type ID | Equivalent Parquet Type | Binary format |
|----------------------|-----------------------------|---------|-----------------------------|---------------------------------------------------------------------------------------------------------------------|
| NullType | null | `0` | any | none |
| Boolean | boolean (True) | `1` | BOOLEAN | none |
| Boolean | boolean (False) | `2` | BOOLEAN | none |
| Exact Numeric | int8 | `3` | INT(8, signed) | 1 byte |
| Exact Numeric | int16 | `4` | INT(16, signed) | 2 byte little-endian |
| Exact Numeric | int32 | `5` | INT(32, signed) | 4 byte little-endian |
| Exact Numeric | int64 | `6` | INT(64, signed) | 8 byte little-endian |
| Double | double | `7` | DOUBLE | IEEE little-endian |
| Exact Numeric | decimal4 | `8` | DECIMAL(precision, scale) | 1 byte scale in range [0, 38], followed by little-endian unscaled value (see decimal table) |
| Exact Numeric | decimal8 | `9` | DECIMAL(precision, scale) | 1 byte scale in range [0, 38], followed by little-endian unscaled value (see decimal table) |
| Exact Numeric | decimal16 | `10` | DECIMAL(precision, scale) | 1 byte scale in range [0, 38], followed by little-endian unscaled value (see decimal table) |
| Date | date | `11` | DATE | 4 byte little-endian |
| Timestamp | timestamp | `12` | TIMESTAMP(true, MICROS) | 8-byte little-endian |
| TimestampNTZ | timestamp without time zone | `13` | TIMESTAMP(false, MICROS) | 8-byte little-endian |
| Float | float | `14` | FLOAT | IEEE little-endian |
| Binary | binary | `15` | BINARY | 4 byte little-endian size, followed by bytes |
| String | string | `16` | STRING | 4 byte little-endian size, followed by UTF-8 encoded bytes |
*Variant primitive types*

| Type Equivalence Class | Physical Type | Type ID | Equivalent Parquet Type | Binary format |
|----------------------|-----------------------------|---------|-----------------------------|---------------------------------------------------------------------------------------------|
| NullType | null | `0` | any | none |
| Boolean | boolean (True) | `1` | BOOLEAN | none |
| Boolean | boolean (False) | `2` | BOOLEAN | none |
| Exact Numeric | int8 | `3` | INT(8, signed) | 1 byte |
| Exact Numeric | int16 | `4` | INT(16, signed) | 2 byte little-endian |
| Exact Numeric | int32 | `5` | INT(32, signed) | 4 byte little-endian |
| Exact Numeric | int64 | `6` | INT(64, signed) | 8 byte little-endian |
| Double | double | `7` | DOUBLE | IEEE little-endian |
| Exact Numeric | decimal4 | `8` | DECIMAL(precision, scale) | 1 byte scale in range [0, 38], followed by little-endian unscaled value (see decimal table) |
| Exact Numeric | decimal8 | `9` | DECIMAL(precision, scale) | 1 byte scale in range [0, 38], followed by little-endian unscaled value (see decimal table) |
| Exact Numeric | decimal16 | `10` | DECIMAL(precision, scale) | 1 byte scale in range [0, 38], followed by little-endian unscaled value (see decimal table) |
| Date | date | `11` | DATE | 4 byte little-endian |
| Timestamp | timestamp with time zone | `12` | TIMESTAMP(isAdjustedToUTC=true, MICROS) | 8-byte little-endian |
| TimestampNTZ | timestamp without time zone | `13` | TIMESTAMP(isAdjustedToUTC=false, MICROS) | 8-byte little-endian |
| Float | float | `14` | FLOAT | IEEE little-endian |
| Binary | binary | `15` | BINARY | 4 byte little-endian size, followed by bytes |
| String | string | `16` | STRING | 4 byte little-endian size, followed by UTF-8 encoded bytes |
| TimeNTZ | time without time zone | `21` | TIME(isAdjustedToUTC=false, MICROS) | 8-byte little-endian |
| Timestamp | timestamp with time zone | `22` | TIMESTAMP(isAdjustedToUTC=true, NANOS) | 8-byte little-endian |
| TimestampNTZ | timestamp without time zone | `23` | TIMESTAMP(isAdjustedToUTC=false, NANOS) | 8-byte little-endian |
| UUID | uuid | `24` | UUID | 16-byte big-endian |

The *Type Equivalence Class* column indicates logical equivalence of physically encoded types.
For example, a user expression operating on a string value containing "hello" should behave the same, whether it is encoded with the short string optimization, or long string encoding.
Similarly, user expressions operating on an *int8* value of 1 should behave the same as a decimal16 with scale 2 and unscaled value 100.

*Decimal table*

| Decimal Precision | Decimal value type |
|-----------------------|--------------------|
Expand All @@ -400,10 +413,6 @@ The Decimal type contains a scale, but no precision. The implied precision of a
| 18 <= precision <= 38 | int128 |
| > 38 | Not supported |

The *Logical Type* column indicates logical equivalence of physically encoded types.
For example, a user expression operating on a string value containing "hello" should behave the same, whether it is encoded with the short string optimization, or long string encoding.
Similarly, user expressions operating on an *int8* value of 1 should behave the same as a decimal16 with scale 2 and unscaled value 100.

# String values must be UTF-8 encoded

All strings within the Variant binary format must be UTF-8 encoded.
Expand Down

0 comments on commit a3dda6a

Please sign in to comment.