From b2ab47978883b845b9079ee0cdbaa3ded804b845 Mon Sep 17 00:00:00 2001 From: Aihua Xu Date: Tue, 29 Oct 2024 14:12:36 -0700 Subject: [PATCH 1/5] Add more types - time, nano timestamps, UUID to Variant. --- VariantEncoding.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/VariantEncoding.md b/VariantEncoding.md index c6d2d113..79b243ae 100644 --- a/VariantEncoding.md +++ b/VariantEncoding.md @@ -392,6 +392,10 @@ The Decimal type contains a scale, but no precision. The implied precision of a | Float | float | `14` | FLOAT | IEEE little-endian | | Binary | binary | `15` | BINARY | 4 byte little-endian size, followed by bytes | | String | string | `16` | STRING | 4 byte little-endian size, followed by UTF-8 encoded bytes | +| TimeNTZ | time without time zone | `21` | TIME(false, MICROS) | 8-byte little-endian | +| Timestamp_ns | timestamp | `22` | TIMESTAMP(true, NANOS) | 8-byte little-endian | +| TimestampNTZ_ns | timestamp without time zone | `23` | TIMESTAMP(false, NANOS) | 8-byte little-endian | +| UUID | uuid | `24` | UUID | 16 bytes | | Decimal Precision | Decimal value type | |-----------------------|--------------------| From c6fc1eb2eac80fc5de1b94ac9c6880b943300d92 Mon Sep 17 00:00:00 2001 From: Aihua Xu Date: Fri, 1 Nov 2024 09:40:30 -0700 Subject: [PATCH 2/5] Update type names to align with Parquet logical type --- VariantEncoding.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/VariantEncoding.md b/VariantEncoding.md index 79b243ae..5b065789 100644 --- a/VariantEncoding.md +++ b/VariantEncoding.md @@ -387,14 +387,14 @@ The Decimal type contains a scale, but no precision. The implied precision of a | Exact Numeric | decimal8 | `9` | DECIMAL(precision, scale) | 1 byte scale in range [0, 38], followed by little-endian unscaled value (see decimal table) | | Exact Numeric | decimal16 | `10` | DECIMAL(precision, scale) | 1 byte scale in range [0, 38], followed by little-endian unscaled value (see decimal table) | | Date | date | `11` | DATE | 4 byte little-endian | -| Timestamp | timestamp | `12` | TIMESTAMP(true, MICROS) | 8-byte little-endian | -| TimestampNTZ | timestamp without time zone | `13` | TIMESTAMP(false, MICROS) | 8-byte little-endian | +| Timestamp | timestamp with time zone | `12` | TIMESTAMP(isAdjustedToUTC=true, MICROS) | 8-byte little-endian | +| Timestamp | timestamp without time zone | `13` | TIMESTAMP(isAdjustedToUTC=false, MICROS) | 8-byte little-endian | | Float | float | `14` | FLOAT | IEEE little-endian | | Binary | binary | `15` | BINARY | 4 byte little-endian size, followed by bytes | | String | string | `16` | STRING | 4 byte little-endian size, followed by UTF-8 encoded bytes | -| TimeNTZ | time without time zone | `21` | TIME(false, MICROS) | 8-byte little-endian | -| Timestamp_ns | timestamp | `22` | TIMESTAMP(true, NANOS) | 8-byte little-endian | -| TimestampNTZ_ns | timestamp without time zone | `23` | TIMESTAMP(false, NANOS) | 8-byte little-endian | +| Time | time without time zone | `21` | TIME(isAdjustedToUTC=false, MICROS) | 8-byte little-endian | +| Timestamp | timestamp with time zone | `22` | TIMESTAMP(isAdjustedToUTC=true, NANOS) | 8-byte little-endian | +| Timestamp | timestamp without time zone | `23` | TIMESTAMP(isAdjustedToUTC=false, NANOS) | 8-byte little-endian | | UUID | uuid | `24` | UUID | 16 bytes | | Decimal Precision | Decimal value type | From de25a2866262b128d3cbe15efa9444a84a2f5e30 Mon Sep 17 00:00:00 2001 From: Aihua Xu Date: Sun, 24 Nov 2024 11:05:48 -0800 Subject: [PATCH 3/5] Update logical type --- VariantEncoding.md | 59 +++++++++++++++++++++++++--------------------- 1 file changed, 32 insertions(+), 27 deletions(-) diff --git a/VariantEncoding.md b/VariantEncoding.md index 5b065789..c65fdc8c 100644 --- a/VariantEncoding.md +++ b/VariantEncoding.md @@ -365,6 +365,7 @@ It is semantically identical to the "string" primitive type. The Decimal type contains a scale, but no precision. The implied precision of a decimal value is `floor(log_10(val)) + 1`. # Encoding types +*Variant basic types* | Basic Type | ID | Description | |--------------|-----|---------------------------------------------------| @@ -373,29 +374,37 @@ The Decimal type contains a scale, but no precision. The implied precision of a | Object | `2` | A collection of (string-key, variant-value) pairs | | Array | `3` | An ordered sequence of variant values | -| Logical Type | Physical Type | Type ID | Equivalent Parquet Type | Binary format | -|----------------------|-----------------------------|---------|-----------------------------|---------------------------------------------------------------------------------------------------------------------| -| NullType | null | `0` | any | none | -| Boolean | boolean (True) | `1` | BOOLEAN | none | -| Boolean | boolean (False) | `2` | BOOLEAN | none | -| Exact Numeric | int8 | `3` | INT(8, signed) | 1 byte | -| Exact Numeric | int16 | `4` | INT(16, signed) | 2 byte little-endian | -| Exact Numeric | int32 | `5` | INT(32, signed) | 4 byte little-endian | -| Exact Numeric | int64 | `6` | INT(64, signed) | 8 byte little-endian | -| Double | double | `7` | DOUBLE | IEEE little-endian | -| Exact Numeric | decimal4 | `8` | DECIMAL(precision, scale) | 1 byte scale in range [0, 38], followed by little-endian unscaled value (see decimal table) | -| Exact Numeric | decimal8 | `9` | DECIMAL(precision, scale) | 1 byte scale in range [0, 38], followed by little-endian unscaled value (see decimal table) | -| Exact Numeric | decimal16 | `10` | DECIMAL(precision, scale) | 1 byte scale in range [0, 38], followed by little-endian unscaled value (see decimal table) | -| Date | date | `11` | DATE | 4 byte little-endian | -| Timestamp | timestamp with time zone | `12` | TIMESTAMP(isAdjustedToUTC=true, MICROS) | 8-byte little-endian | -| Timestamp | timestamp without time zone | `13` | TIMESTAMP(isAdjustedToUTC=false, MICROS) | 8-byte little-endian | -| Float | float | `14` | FLOAT | IEEE little-endian | -| Binary | binary | `15` | BINARY | 4 byte little-endian size, followed by bytes | -| String | string | `16` | STRING | 4 byte little-endian size, followed by UTF-8 encoded bytes | -| Time | time without time zone | `21` | TIME(isAdjustedToUTC=false, MICROS) | 8-byte little-endian | -| Timestamp | timestamp with time zone | `22` | TIMESTAMP(isAdjustedToUTC=true, NANOS) | 8-byte little-endian | -| Timestamp | timestamp without time zone | `23` | TIMESTAMP(isAdjustedToUTC=false, NANOS) | 8-byte little-endian | -| UUID | uuid | `24` | UUID | 16 bytes | +*Variant primitive types* + +| Logical Type | Physical Type | Type ID | Equivalent Parquet Type | Binary format | +|----------------------|-----------------------------|---------|-----------------------------|---------------------------------------------------------------------------------------------| +| NullType | null | `0` | any | none | +| Boolean | boolean (True) | `1` | BOOLEAN | none | +| Boolean | boolean (False) | `2` | BOOLEAN | none | +| Exact Numeric | int8 | `3` | INT(8, signed) | 1 byte | +| Exact Numeric | int16 | `4` | INT(16, signed) | 2 byte little-endian | +| Exact Numeric | int32 | `5` | INT(32, signed) | 4 byte little-endian | +| Exact Numeric | int64 | `6` | INT(64, signed) | 8 byte little-endian | +| Double | double | `7` | DOUBLE | IEEE little-endian | +| Exact Numeric | decimal4 | `8` | DECIMAL(precision, scale) | 1 byte scale in range [0, 38], followed by little-endian unscaled value (see decimal table) | +| Exact Numeric | decimal8 | `9` | DECIMAL(precision, scale) | 1 byte scale in range [0, 38], followed by little-endian unscaled value (see decimal table) | +| Exact Numeric | decimal16 | `10` | DECIMAL(precision, scale) | 1 byte scale in range [0, 38], followed by little-endian unscaled value (see decimal table) | +| Date | date | `11` | DATE | 4 byte little-endian | +| Timestamp | timestamp with time zone | `12` | TIMESTAMP(isAdjustedToUTC=true, MICROS) | 8-byte little-endian | +| TimestampNTZ | timestamp without time zone | `13` | TIMESTAMP(isAdjustedToUTC=false, MICROS) | 8-byte little-endian | +| Float | float | `14` | FLOAT | IEEE little-endian | +| Binary | binary | `15` | BINARY | 4 byte little-endian size, followed by bytes | +| String | string | `16` | STRING | 4 byte little-endian size, followed by UTF-8 encoded bytes | +| TimeNTZ | time without time zone | `21` | TIME(isAdjustedToUTC=false, MICROS) | 8-byte little-endian | +| Timestamp | timestamp with time zone | `22` | TIMESTAMP(isAdjustedToUTC=true, NANOS) | 8-byte little-endian | +| TimestampNTZ | timestamp without time zone | `23` | TIMESTAMP(isAdjustedToUTC=false, NANOS) | 8-byte little-endian | +| UUID | uuid | `24` | UUID | 16-byte big-endian | + +The *Logical Type* column indicates logical equivalence of physically encoded types. +For example, a user expression operating on a string value containing "hello" should behave the same, whether it is encoded with the short string optimization, or long string encoding. +Similarly, user expressions operating on an *int8* value of 1 should behave the same as a decimal16 with scale 2 and unscaled value 100. + +*Decimal table* | Decimal Precision | Decimal value type | |-----------------------|--------------------| @@ -404,10 +413,6 @@ The Decimal type contains a scale, but no precision. The implied precision of a | 18 <= precision <= 38 | int128 | | > 38 | Not supported | -The *Logical Type* column indicates logical equivalence of physically encoded types. -For example, a user expression operating on a string value containing "hello" should behave the same, whether it is encoded with the short string optimization, or long string encoding. -Similarly, user expressions operating on an *int8* value of 1 should behave the same as a decimal16 with scale 2 and unscaled value 100. - # String values must be UTF-8 encoded All strings within the Variant binary format must be UTF-8 encoded. From e40e3f44b779240c8f170ad488241fe3fea447fb Mon Sep 17 00:00:00 2001 From: Aihua Xu Date: Sun, 8 Dec 2024 16:20:57 -0800 Subject: [PATCH 4/5] Update VariantEncoding.md Co-authored-by: emkornfield --- VariantEncoding.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/VariantEncoding.md b/VariantEncoding.md index c65fdc8c..2359ffe6 100644 --- a/VariantEncoding.md +++ b/VariantEncoding.md @@ -376,7 +376,7 @@ The Decimal type contains a scale, but no precision. The implied precision of a *Variant primitive types* -| Logical Type | Physical Type | Type ID | Equivalent Parquet Type | Binary format | +| Type Equivalence Class | Physical Type | Type ID | Equivalent Parquet Type | Binary format | |----------------------|-----------------------------|---------|-----------------------------|---------------------------------------------------------------------------------------------| | NullType | null | `0` | any | none | | Boolean | boolean (True) | `1` | BOOLEAN | none | From c0c78fa070bc58c84e9923000ba29f7a2af897c3 Mon Sep 17 00:00:00 2001 From: Aihua Xu Date: Sun, 8 Dec 2024 16:21:05 -0800 Subject: [PATCH 5/5] Update VariantEncoding.md Co-authored-by: emkornfield --- VariantEncoding.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/VariantEncoding.md b/VariantEncoding.md index 2359ffe6..02bcd0c0 100644 --- a/VariantEncoding.md +++ b/VariantEncoding.md @@ -400,7 +400,7 @@ The Decimal type contains a scale, but no precision. The implied precision of a | TimestampNTZ | timestamp without time zone | `23` | TIMESTAMP(isAdjustedToUTC=false, NANOS) | 8-byte little-endian | | UUID | uuid | `24` | UUID | 16-byte big-endian | -The *Logical Type* column indicates logical equivalence of physically encoded types. +The *Type Equivalence Class* column indicates logical equivalence of physically encoded types. For example, a user expression operating on a string value containing "hello" should behave the same, whether it is encoded with the short string optimization, or long string encoding. Similarly, user expressions operating on an *int8* value of 1 should behave the same as a decimal16 with scale 2 and unscaled value 100.