diff --git a/VariantEncoding.md b/VariantEncoding.md index c6d2d113..02bcd0c0 100644 --- a/VariantEncoding.md +++ b/VariantEncoding.md @@ -365,6 +365,7 @@ It is semantically identical to the "string" primitive type. The Decimal type contains a scale, but no precision. The implied precision of a decimal value is `floor(log_10(val)) + 1`. # Encoding types +*Variant basic types* | Basic Type | ID | Description | |--------------|-----|---------------------------------------------------| @@ -373,25 +374,37 @@ The Decimal type contains a scale, but no precision. The implied precision of a | Object | `2` | A collection of (string-key, variant-value) pairs | | Array | `3` | An ordered sequence of variant values | -| Logical Type | Physical Type | Type ID | Equivalent Parquet Type | Binary format | -|----------------------|-----------------------------|---------|-----------------------------|---------------------------------------------------------------------------------------------------------------------| -| NullType | null | `0` | any | none | -| Boolean | boolean (True) | `1` | BOOLEAN | none | -| Boolean | boolean (False) | `2` | BOOLEAN | none | -| Exact Numeric | int8 | `3` | INT(8, signed) | 1 byte | -| Exact Numeric | int16 | `4` | INT(16, signed) | 2 byte little-endian | -| Exact Numeric | int32 | `5` | INT(32, signed) | 4 byte little-endian | -| Exact Numeric | int64 | `6` | INT(64, signed) | 8 byte little-endian | -| Double | double | `7` | DOUBLE | IEEE little-endian | -| Exact Numeric | decimal4 | `8` | DECIMAL(precision, scale) | 1 byte scale in range [0, 38], followed by little-endian unscaled value (see decimal table) | -| Exact Numeric | decimal8 | `9` | DECIMAL(precision, scale) | 1 byte scale in range [0, 38], followed by little-endian unscaled value (see decimal table) | -| Exact Numeric | decimal16 | `10` | DECIMAL(precision, scale) | 1 byte scale in range [0, 38], followed by little-endian unscaled value (see decimal table) | -| Date | date | `11` | DATE | 4 byte little-endian | -| Timestamp | timestamp | `12` | TIMESTAMP(true, MICROS) | 8-byte little-endian | -| TimestampNTZ | timestamp without time zone | `13` | TIMESTAMP(false, MICROS) | 8-byte little-endian | -| Float | float | `14` | FLOAT | IEEE little-endian | -| Binary | binary | `15` | BINARY | 4 byte little-endian size, followed by bytes | -| String | string | `16` | STRING | 4 byte little-endian size, followed by UTF-8 encoded bytes | +*Variant primitive types* + +| Type Equivalence Class | Physical Type | Type ID | Equivalent Parquet Type | Binary format | +|----------------------|-----------------------------|---------|-----------------------------|---------------------------------------------------------------------------------------------| +| NullType | null | `0` | any | none | +| Boolean | boolean (True) | `1` | BOOLEAN | none | +| Boolean | boolean (False) | `2` | BOOLEAN | none | +| Exact Numeric | int8 | `3` | INT(8, signed) | 1 byte | +| Exact Numeric | int16 | `4` | INT(16, signed) | 2 byte little-endian | +| Exact Numeric | int32 | `5` | INT(32, signed) | 4 byte little-endian | +| Exact Numeric | int64 | `6` | INT(64, signed) | 8 byte little-endian | +| Double | double | `7` | DOUBLE | IEEE little-endian | +| Exact Numeric | decimal4 | `8` | DECIMAL(precision, scale) | 1 byte scale in range [0, 38], followed by little-endian unscaled value (see decimal table) | +| Exact Numeric | decimal8 | `9` | DECIMAL(precision, scale) | 1 byte scale in range [0, 38], followed by little-endian unscaled value (see decimal table) | +| Exact Numeric | decimal16 | `10` | DECIMAL(precision, scale) | 1 byte scale in range [0, 38], followed by little-endian unscaled value (see decimal table) | +| Date | date | `11` | DATE | 4 byte little-endian | +| Timestamp | timestamp with time zone | `12` | TIMESTAMP(isAdjustedToUTC=true, MICROS) | 8-byte little-endian | +| TimestampNTZ | timestamp without time zone | `13` | TIMESTAMP(isAdjustedToUTC=false, MICROS) | 8-byte little-endian | +| Float | float | `14` | FLOAT | IEEE little-endian | +| Binary | binary | `15` | BINARY | 4 byte little-endian size, followed by bytes | +| String | string | `16` | STRING | 4 byte little-endian size, followed by UTF-8 encoded bytes | +| TimeNTZ | time without time zone | `21` | TIME(isAdjustedToUTC=false, MICROS) | 8-byte little-endian | +| Timestamp | timestamp with time zone | `22` | TIMESTAMP(isAdjustedToUTC=true, NANOS) | 8-byte little-endian | +| TimestampNTZ | timestamp without time zone | `23` | TIMESTAMP(isAdjustedToUTC=false, NANOS) | 8-byte little-endian | +| UUID | uuid | `24` | UUID | 16-byte big-endian | + +The *Type Equivalence Class* column indicates logical equivalence of physically encoded types. +For example, a user expression operating on a string value containing "hello" should behave the same, whether it is encoded with the short string optimization, or long string encoding. +Similarly, user expressions operating on an *int8* value of 1 should behave the same as a decimal16 with scale 2 and unscaled value 100. + +*Decimal table* | Decimal Precision | Decimal value type | |-----------------------|--------------------| @@ -400,10 +413,6 @@ The Decimal type contains a scale, but no precision. The implied precision of a | 18 <= precision <= 38 | int128 | | > 38 | Not supported | -The *Logical Type* column indicates logical equivalence of physically encoded types. -For example, a user expression operating on a string value containing "hello" should behave the same, whether it is encoded with the short string optimization, or long string encoding. -Similarly, user expressions operating on an *int8* value of 1 should behave the same as a decimal16 with scale 2 and unscaled value 100. - # String values must be UTF-8 encoded All strings within the Variant binary format must be UTF-8 encoded.