DRAFT: Alternative Strawman proposal for a new V3 footer format in Parquet #250

emkornfield · 2024-05-27T20:42:49Z

As a point of discussion a slightly different version of showing how column metadata could be decoupled from FileMetadata then #242

In particular this takes a slightly different approach:

It introduces a new random access encoding for Parquet to store serialized data instead of relying a one-off index scheme based in the thrift structure. By taking this approach is allows flexibility for implemenations to further balance size vs compute trade-offs and can potentially make use of any further encoding improvements in the future. Two downside of this approach is it requires a little bit more work up-front and has slightly more overhead then directly doing this as thrift structures.
It places the serialized data page completely outside of thrift metadata and instead provides an offset within the footer. This is mostly a micro-optimization (likely not critical) to allow parquet implementors to avoid unnecessary copies of string data if the Thrift library supporting it does not allow it. There is no reason that the pages could not be inlined as a "binary" field in FileMetadata as is done in DRAFT: Parquet 3 metadata with decoupled column metadata #242
Moves a few other fields out of FileMetadata into a metadata page and raises discussion points on others.
Re-uses existing Thrift objects in attempt to make the transition easier for implementors.

Things it does not do:

Enumerate all fields that should be deprecated DRAFT: Parquet 3 metadata with decoupled column metadata #242 is a good start and can consolidated on a final list once a general approach is taken.
Incorporate changes in DRAFT: Incremental improvements to parquet metadata #248 these also likely make sense but can be incorporated into any final proposal.
Micro-optimizations to separate scan use cases from filter evaluation use-cases (ColumnChunk structure could potentially be further split apart to give finer grained access to elements that are only needed in once case or another).

Salient points: 1. Introduce a new encoding that allows random access for byte arrays 2. Use page info as a structure for storing lists. 3. Storage pages out of line of thrift.

steveloughran

I like this, provided we are confident that the shipping readers are all happy with it. Easy way to test this: modify parquet-java to add a fake par3 footer and verify that the reading tests all work.

I would prefer absolute file location rather than an offset to the start of the metadata; but it's probably harder to calculate, correct?

Finally, is there anything needed to support an extended par3 trailing footer in future?

steveloughran · 2024-05-28T11:05:49Z

README.md

+  Therefore for PAR1 written before PAR3 was introduced are always expect the files to have the following 
+  trailing 9 bytes [0x00, x, x, x, x, P, A, R, 1] (where x can be any value).  We also expect all compliant 
+  Thrift parsers to only parse the first available FileMetadata message and stop consuming the stream once read. 
+  Today, we don't believe that any Parquet readers validate that the entire "length in bytes of file metadata" 


there's going be a need for some regression tests here, maybe module to validate output from parquet-java against parquet readers from: older parquet-java release(s), parquet-cpp, impala parser, parquet dotnet. Because these are the older/frozen releases, a docker image with everything installed should work.

I think it would be sufficient to generate dummy PAR1 file with extraneous padding at the end of the FileMetadata serialization, and see if all implementations can still read it.

There is a way to avoid this. I am preparing a proposal for it.

I was thinking of doing this just by modifying parquet-java to add some trailing bytes and see what broke in the rest of the code, that is: make no changes to the reader

@steveloughran That was my suggestion as well :-)

@alkis if this is the unknown thrift field approach, I originally started drafting it up but figured I'd hold off due to @pitrou strong feelings but I think it is likely worth seeing it on paper which might help weight the pros/cons.

I'll revise this and move the design considerations elsewhere.

@alkis if this was the approached you sketched on https://github.com/apache/parquet-format/pull/242/files#r1607838732 since I'm revising this anyways, i can incorporate it into the PR so we can consolidate discussion.

Left a TODO in the readme to figure out desired approach.

Removed most of this phrasing in favor of a layout.

I am talking about an enhanced version of my original proposal in #242.

https://github.com/apache/parquet-format/pull/254/files?short_path=b335630#diff-b335630551682c19a781afebcf4d07bf978fb1f8ac04c6bf87428ed5106870f5

I feel this is very powerful. The only complexity is only on writers to incorporate the short snippet of code that appends binary to an existing thrift struct. Readers can look at the tail of any thrift serialized struct and find the extension they expect without parsing thrift.

This means that we can use the approach as a generalized extension mechanism for any thrift struct in parquet.thrift. Once a vendor has a solid extension (or part of the extension) they want to make official it can enter the official struct proper and be removed from the extension.

README.md

src/main/thrift/parquet.thrift

pitrou · 2024-05-28T13:52:55Z

README.md

+  #### Dual Mode PAR1 and PAR3 footers
+
+  There is a desire to gradually rollout PAR3 footers to allow newer readers to take advantage of them, while
+  older readers can still properly parse the file. This section outlines a strategy to do this.


Just my 2 cents, but I'd rather see a sketch of the actual layout, rather than a description of a strategy :-)
(especially as the strategy is best left to each implementation, depending on their specific constraints and tradeoffs)

I can add this if we decide to move forward with this PR (as compared to the your proposal).

Change this to just describe the layout.

README.md

pitrou · 2024-05-28T13:55:13Z

src/main/thrift/parquet.thrift

-  /** Path in schema **/
-  3: required list<string> path_in_schema
+  /** Path in schema 
+    *  Example of deprecated a field for PAR3


Obviously I'm a bit biased, but I find the approach of defining separate structs (FileMetadataV3, etc.) much cleaner than tediously documenting what is required in V1 and forbidden in V3, and vice-versa.

I think there are trade-offs I replied on the mailing list with more details (hopefully we can centralize discussion there).

emkornfield · 2024-05-28T18:19:55Z

I would prefer absolute file location rather than an offset to the start of the metadata; but it's probably harder to calculate, correct?

Probably just as easy. However, it means the footer is not self-contained for systems that might want to cache it (they would need to store an additional size value for the file).

emkornfield

Update bit mask to

emkornfield · 2024-05-29T14:31:27Z

Finally, is there anything needed to support an extended par3 trailing footer in future?

Sorry I missed this. I think the bitmap gives flexibility for extending par3. Or was there something else you were concerned with?

src/main/thrift/parquet.thrift

alkis

At high level, the direction of splitting metadata feels flawed:

for wide footers, we know today we spend 10s of milliseconds parsing them. Ideally we want to spend below a millisecond to parse them.
assuming >99% of all data stored in parquet are in object stores and doing a single fetch from object store costs 10s of milliseconds, we want to do only one fetch from the object store

By splitting metadata we are fixing (1) but we are requiring a secondary fetch from the object store that depends on the first fetch and we make (2) twice as long. At this point we are back to square 1...

What we want instead is to make parsing of metadata 10-100x faster and to keep the metadata small so that a small (512kb) speculative fetch of the end of a parquet file will contain all the metadata.

alkis · 2024-05-30T10:44:32Z

README.md

+   - 4-byte little-endian flag field to indicate features that require special parsing of the footer.
+     Readers MUST raise an error if there is an unrecognized flag.  Current flags:
+
+     * 0x01 - Footer encryption enabled (when set the encryption information is written before 
+        FileMeta structure as in the PAR1 footer).
+     * 0x02 - CRC32 of FileMetadata Footer.


I don't find this flag particularly elegant. There is already a magic number. We can employ the same mechanism as PAR1/PARE had to distinguish an encrypted footer.

I'm in favor of the feature flags and it in turn provides unified PAR3 magic. However, I'm a little bit concerned with the 4 byte limit. Or maybe we need to clarify what kind of features are qualified to be added to it.

So for context I want from 8 bits to 64 bits and I think 32 is a reasonable default. The intent of this particular bitmap is to indicate to consumers that there is backwards incompatible change. I think if we had this originally we wouldn't of used a different magic number. Forward looking I imagine things that might go here if we chose to pursue them:

Allowing non-continguous pages (i.e. making offset index mandatory).

Compressing the footer as a whole block (might be beneficial if we get to flatbuffers).

Changing to flatbuffers in the future.

I would hope these would be relatively rare, and the last flag of this bitmap can always be reserved to indicate yet another bitmap.

@wgtmac what use-cases where you thinking of?

I was thinking to leverage the features flag to continue the effort in #164 which defines core features that the parquet file implements.

What are the benefits of feature flags in this place in the file?

We have FileMetaData.version which can serve the purpose of feature flags for everything, except encrypted footer (since we can't otherwise read them). Put it differently the only feature we cannot encode inside the footer itself is if the footer is encrypted. For this it seems we can keep using a secondary magic number forever?

@wgtmac yeah I was imagining this for much fewer use-cases and I think for features that readers can detect as they read that they don't understand I think it is fine for it to happen lazily.

Put it differently the only feature we cannot encode inside the footer itself is if the footer is encrypted. For this it seems we can keep using a secondary magic number forever?

@alkis at the very least compression. If we switch to flatbuffers I believe they compress quite well (a lot of extra padding in integers)? Would we then have a few more magic footers for the cross product of compression and and encryption?

Again, I think there are only even a handful of imagined use-cases that this can be used which is originally why I had it as a single byte originally, IMO it is a small cost to pay for some potential flexibility. and is useful at least for encryption.

alkis · 2024-05-30T10:46:56Z

README.md

+ PAR3 file footer footer format designed to better support wider-schemas and more control
+ over the various footer size vs compute trade-offs.  Its format is as follows:
+   - Serialized Thrift FileMetadata Structure
+   - (Optional) 4 byte CRC32 of the serialized Thrift FileMetadata.


This should not be optional.

All bytes are not equal in a file. In particular footer bytes are very important because if those are corrupt - we can't read any bytes of the file. If anything Footers not having a required checksum for their content is a design flaw of the original parquet specification.

I tend to agree, other had concerns that it was gratuitous. I can update to make it required.

alkis · 2024-05-30T10:49:02Z

README.md

+ over the various footer size vs compute trade-offs.  Its format is as follows:
+   - Serialized Thrift FileMetadata Structure
+   - (Optional) 4 byte CRC32 of the serialized Thrift FileMetadata.
+   - 4-byte length in bytes (little endian) of all preceding elements in the footer.


I recommend a crc32 of the length itself. This is to detect early that a footer is corrupt and avoid reading some epic amount of garbage from the end of the file. For example think of a bit flip in one of the top bits of the length, it will cause a reader to read 100s of MBs of the end of the file only to check that the crc doesn't match.

I'm OK adding this. As a counter pointer 100s of MB would ideally be rejected by reasonable memory limitations on the footer.

there's generally an expectation that filesystems do validation, but CRCs are relatively low cost, and help find problems in networking and NIC

alkis · 2024-05-30T10:49:49Z

src/main/thrift/parquet.thrift

-  /** Optional key/value metadata **/
+  /** Optional key/value metadata 
+    * PAR1: Optional. 
+    * PAR3: Don't write use key_value_metadata instead.


yes, will fix.

alkis · 2024-05-30T10:52:18Z

src/main/thrift/parquet.thrift

+   // Number of elements stored.  This is duplicated here to help in 
+   // use-cases where knowing the total number of elements up front for
+   // computation would be useful.
+   3: num_values


missing type?

Is it possible to add a list<i32> with num_values + 1 size to indicate the offsets to each element before compression to support random access? Or is it supposed to use fixed-length element at all times?

For fixed width types PLAIN encoding gives us this data.

For variable width types, I wanted to solve this in a more general way with the RANDOM_ACCESS_BYTE_ARRAY so that users could choose this to allow for random access. In future iterations when we tackle improving for "point lookup" use cases I imagine that would also help or be another option.

emkornfield · 2024-05-30T13:26:08Z

By splitting metadata we are fixing (1) but we are requiring a secondary fetch from the object store that depends on the first fetch and we make (2) twice as long. At this point we are back to square 1...

@alkis I think I might have misphrased something here or misdesigned something here. Where do you see the require second fetch as a requirement. I imagine performant readers will still read a significant portion of the tail of the file to avoid the two fetches.

alkis · 2024-05-30T15:07:42Z

@alkis I think I might have misphrased something here or misdesigned something here. Where do you see the require second fetch as a requirement. I imagine performant readers will still read a significant portion of the tail of the file to avoid the two fetches.

Apologies that was my misunderstanding. I conflated #242 which has offsets for metadata elsewhere.

Now that I understand this better it seems that the main trick we are employing is converting list<ExpensiveToDecodePerColumnStruct> to something like list<binary> (with optional compression) to make deserialization lazy.

How is this going to affect workloads that read all the columns (and thus require all the metadata). Are they getting any benefit?
The effort to fix readers is not going to be trivial. Perhaps it will be similar if we revamp the schema substantially. Why should we stick to thrift if we are going through the effort to refactor readers?

alkis · 2024-05-30T14:54:50Z

src/main/thrift/parquet.thrift

+    * Each element is a serialized SchemaElement.  The order and content should
+    * have a one to one correspondence with schema.
+    */
+  10: optional MetadataPage schema_page;


Why isn't this list<binary> where each binary is a serialized SchemaElement?

My intent was to introduce an encoding that allows zero-copy random access which I think would be better then list which I would guess might be slightly better. Plain encoding is effectively equivelant to list on the wire I believe, and this way we avoid the up front cost of decoding the list.

@emkornfield have you given more thought to a schema index? One use for random access is column projection, but how does one figure out column indexes for the projection without fully decoding the schema?

A little bit I had a few ideas that I haven't had to write down. Starting from the easiest:

For formats like Iceberg that use field_id three columns (exact physical layout TBD):

For name based indexes, I was thinking of flattened list of schema elements, that is flattened in bread-first order. At each level we sort the column names (I think there are likely two colations that are might be useful, normal lexicographic, and case-insensitive (all column names are normalized to lower case). The column names are associated with an offset and width to there children. There are potentially more complex data structures that could make this more efficient but it seems like a reasonable start.

In both cases I think we need to potentially make a decision if this is 1 to 1 or 1 to many (e.g. I've seen parquet files with the duplicate column names that differ only by whether they are upper case or lower case.

My intent was to introduce an encoding that allows zero-copy random access which I think would be better then list which I would guess might be slightly better.

Benchmarks can guide this. From my experiments decoding list<binary> is so much faster than list<MyStruct> that it doesn't show in profiles anymore. Plus this field is O(columns). The ones we really have to optimize are the column chunks that are O(columns * row groups).

alkis · 2024-05-30T15:09:50Z

src/main/thrift/parquet.thrift

+   // A serialized page including metadata thrift header and data.
+   1: required binary page
+   // Optional compression applied to the page.
+   2: optional CompressionCodec compression


My feel is that most of the pages encoded in this form are going to be smallish, less than 1kb. For such small sizes, none of the general purpose compressors will do a good job at compressing.

Are there any benchmarks where we can see the effectiveness compressing the above pages?

I think column data is at least 20 bytes per column? this means we get to ~1kb at 50 column and 4kb at at ~200 which I don't think are unreasonable. What the relative compressibility is metadata is an open question. I will try to measure this for wider schemas. For smaller structures that use this I think the default would be not to compress.

I suggest not doing this at all unless benchmarks show it is a large win. From experience with general purpose compressors (lempel ziv style) they fail to compress small entities like ints or ulebs. Since that's the majority of stuff we will put here, I do not expect compression to be profitable.

@alkis I'm OK either way since compression is already a supported concept I thought this was a low cost effort that might have value in some use cases.

As a litmus test, I'm curious what happens if you run LZ4 on the footers you have been experimenting with?

wgtmac · 2024-05-30T15:42:21Z

src/main/thrift/parquet.thrift

+    * PAR1: Start populating after encoding_stats is deprecated.
+    * PAR3: Populate instead of encoding_stats.
+    */
+  17: optional binary serialized_encoding_stats


I think it makes sense to adopt @alkis's idea of using a boolean field here to indicate whether dictionary encoding is applied to all data pages.

Sorry I just saw line 943. I meant field 17 is not required any more.

But I just got this implemented 😭 😉. Joking aside, is there any other use for this field beyond optimizing dictionary based queries?

Edit: looking above, encodings is now deprecated, so this field replaces that and serves the purpose of @alkis's boolean field. So if we adopt the latter, encodings will need to remain required I think.

Yes I think it serves three purposes:

consolidate encodings.

Determine optimizations available in the engine (e.g. fully dictionary encoded, fully RLE, etc)

Instrumentation for engines to understand what the distribution of data they see to prioritize optimizations for specific encodings.

I'm OK removing completely but my concern around #2 is spec changes take a while to propagate, having the information for engines to determine what optimizations they want to do without requiring a new flag I think is useful.

Can we drop encodings? The purpose is to detect if a reader can read a column or not. What's going to happen if we don't have that information? Reader will try to read and maybe will fail the query mid way. Isn't the latter an acceptable behavior as well?

If ColumnMetaData::encodings could distinguish between PLAIN encoded dictionary and PLAIN data encodings, the detection of full dictionary encoding could be based on that without page specific statistics. If Encoding::PLAIN_DICTIONARY could be un-deprecated and indicate a PLAIN encoded dictionary page then encoding_stats might not be needed for most usecases. A list of [PLAIN_DICTIONARY, RLE_DICTIONARY] could indicate full dictionary encoding for the column.

yes, I thought I had marked encodings as dropped (as the information here duplicates it at the very least), if not I'll update it. I'm also OK with readers discovering unreadable pages lazily.

@jhorstmann this seems plausible however, I still think there is use in this data from a debuggability standpoint to elaborate on #3 I shouldn't have just limited it to engines. Being able to dump encodings and how many there are quickly allows users to understand a little bit better how there data is being compressed and let them potentially fine tune these aspects. It isn't a frequent occurrence but there is certainly cases when having a little bit more data helps with debugability.

wgtmac · 2024-05-30T15:52:55Z

src/main/thrift/parquet.thrift

+    *
+    * Populated in both PAR1 and PAR3
+    */
+  10: optional ColumnOrder column_order


I'm afraid this might complicate page index because we have to check consistency of ColumnOrder across row groups. At the moment we have only one ColumnOrder, which is not a big issue. It may have problem if we introduce more orders in the future. This order is important to guide us on how to interpret serialized min_value/max_values in the statistics. Perhaps we can put this in the SchemaElement instead?

Schema element sounds OK to me we can also leave as is, I think I copied this from #242

I think maybe just keeping as is make sense actually, so I'll remove this field and we can keep the list or move that to a DATA_PAGE

Do we have examples of other orders that might be useful?
Can drop this concept alltogether and make only one ordering available?

different collations are a not-uncommon idea in databases (e.g. case insensitive comparison). One could argue on whether this should be modeled as a different logical type, a requirement on clients on how they store the data or using this concept but it is not unreasonable to model it in this way?

here is #221 (follow on to #196)

etseidl · 2024-05-30T16:21:32Z

src/main/thrift/parquet.thrift

+ * A metadata page is a data page used to store metadata about
+ * the data stored in the file. This is a key feature of PAR3
+ * footers which allow for deferred decoding of metadata.


So the page will have a PageHeader/DataPageHeader at the top? Will nulls or repetition be allowed, i.e. do we need definition and repetition level data? If not, then should we define a new page type instead so we don't have to encode unused level encoding types? Then we could also drop the language below about not writing statistics.

This is also a reasonable approach to define a new page type. I was leaving this open in case in the future we want nulls. Whether nulls are allowed and exact structure is dictated by its use on the field and also to minimize spec changes in this draft. The nice thing about this approach is it can work transparently if/when a new page type is added.

emkornfield · 2024-05-30T17:15:53Z

Now that I understand this better it seems that the main trick we are employing is converting list<ExpensiveToDecodePerColumnStruct> to something like list<binary> (with optional compression) to make deserialization lazy.

How is this going to affect workloads that read all the columns (and thus require all the metadata). Are they getting any benefit?

No it does not. How much regression is present is something we need to benchmark.

The effort to fix readers is not going to be trivial. Perhaps it will be similar if we revamp the schema substantially. Why should we stick to thrift if we are going through the effort to refactor readers?

I think we might have different priors on effort involved and scope of changes for readers. I was hoping with these changes existing readers could update to use it without the benefit of laziness in O(hours) - O(days) (e.g. if they happen to be exposing thrift structures as a public API in there system like DuckDB). If systems already have abstractions in place for metadata making use of the laziness should also be pretty quick. I see changing to a different encoding format as more efforts and when multiplied across the number of custom parquet readers makes it less likely for ecosystem penetration or at least take longer. I don't want to rule out moving to flatbuffers in the future (this is yet another reason for the feature bitmap).

CC @JFinis I think your perspective here might be valuable here.

steveloughran · 2024-05-30T17:31:14Z

(e.g. if they happen to be exposing thrift structures as a public API)

what do we know here about this? which apps do/don't?

wgtmac · 2024-05-31T01:28:49Z

(e.g. if they happen to be exposing thrift structures as a public API)

what do we know here about this? which apps do/don't?

@steveloughran There is a preliminary investigation: https://docs.google.com/document/d/1PQpY418LkIDHMFYCY8ne_G-CFpThK15LLpzWYbc7rFU/edit#heading=h.wkbvu2wh1ahj

alkis · 2024-05-31T13:28:09Z

README.md

+   - 4-byte little-endian flag field to indicate features that require special parsing of the footer.
+     Readers MUST raise an error if there is an unrecognized flag.  Current flags:
+
+     * 0x01 - Footer encryption enabled (when set the encryption information is written before 
+        FileMeta structure as in the PAR1 footer).
+     * 0x02 - CRC32 of FileMetadata Footer.


What are the benefits of feature flags in this place in the file?

We have FileMetaData.version which can serve the purpose of feature flags for everything, except encrypted footer (since we can't otherwise read them). Put it differently the only feature we cannot encode inside the footer itself is if the footer is encrypted. For this it seems we can keep using a secondary magic number forever?

alkis · 2024-05-31T13:32:37Z

src/main/thrift/parquet.thrift

@@ -792,12 +806,23 @@ struct ColumnMetaData {
  6: required i64 total_uncompressed_size


Can we make num_values, total_uncompressed_size, total_compressed_size i32s?

This doesn't matter much for Thrift, but if we are happy with such a change, it makes a difference for other encodings like flatbuffers.

In addition num_values can be optional and if left unset it can inherit RowGroup.num_rows. Most column chunks are dense and we can save repeating the same value over and over for every column.

Can we make num_values, total_uncompressed_size, total_compressed_size i32s?

No, we've had bugs in the past due to i32 overflow in various implementations and incompatibilities with Arrow's i32 offsets because the data stored is larger. I don't recall which of these fields had issues exactly but based on that it would indicate that there are in fact some users that overflow at least signed representations, so even unsigned int32 seems like a potential risk.

In addition num_values can be optional and if left unset it can inherit RowGroup.num_rows. Most column chunks are dense and we can save repeating the same value over and over for every column.

I agree this is a reasonable optimizations.

I would think uncompressed_size at a minimum would have to remain i64. I've seen single row groups with string columns that exceed 2GB in size. I'd argue the same for total_compressed_size on uncompressed data.

alkis · 2024-05-31T13:41:42Z

src/main/thrift/parquet.thrift

  8: optional list<KeyValue> key_value_metadata

+  /** See description on FileMetata.key_value_metadata **/
+  19: optional MetadataPage key_value_metadata_page


This is unlikely to make a huge difference. In case it does, should we consider this also:

struct KeyVals { 1: optional list<string> keys; 2: optional list<binary> vals; }

Maybe in the weeds, but how would one indicate a null value for a key and distinguish it from an empty string? (value is optional in KeyValue).

alkis · 2024-05-31T13:44:59Z

src/main/thrift/parquet.thrift

+    * PAR1: Start populating after encoding_stats is deprecated.
+    * PAR3: Populate instead of encoding_stats.
+    */
+  17: optional binary serialized_encoding_stats


Can we drop encodings? The purpose is to detect if a reader can read a column or not. What's going to happen if we don't have that information? Reader will try to read and maybe will fail the query mid way. Isn't the latter an acceptable behavior as well?

alkis · 2024-05-31T13:48:01Z

src/main/thrift/parquet.thrift

+    *
+    * Populated in both PAR1 and PAR3
+    */
+  10: optional ColumnOrder column_order


Do we have examples of other orders that might be useful?
Can drop this concept alltogether and make only one ordering available?

alkis · 2024-05-31T13:50:25Z

src/main/thrift/parquet.thrift

  8: optional list<KeyValue> key_value_metadata

+  /** See description on FileMetata.key_value_metadata **/
+  19: optional MetadataPage key_value_metadata_page
+
  /** Byte offset from beginning of file to first data page **/
  9: required i64 data_page_offset


This is almost always the same as the start of the columnchunk. We should make this optional and imply it is the same as ColumnChunk.file_offset if missing.

If we indeed put the ColumnMetaData in front of the chunk, then this would no longer be true.

This is almost always the same as the start of the columnchunk. We should make this optional and imply it is the same as ColumnChunk.file_offset if missing.

Makes sense

alkis · 2024-05-31T13:54:15Z

src/main/thrift/parquet.thrift

+   // A serialized page including metadata thrift header and data.
+   1: required binary page
+   // Optional compression applied to the page.
+   2: optional CompressionCodec compression


I suggest not doing this at all unless benchmarks show it is a large win. From experience with general purpose compressors (lempel ziv style) they fail to compress small entities like ints or ulebs. Since that's the majority of stuff we will put here, I do not expect compression to be profitable.

alkis · 2024-05-31T13:59:51Z

src/main/thrift/parquet.thrift

+    * Each element is a serialized SchemaElement.  The order and content should
+    * have a one to one correspondence with schema.
+    */
+  10: optional MetadataPage schema_page;


My intent was to introduce an encoding that allows zero-copy random access which I think would be better then list which I would guess might be slightly better.

Benchmarks can guide this. From my experiments decoding list<binary> is so much faster than list<MyStruct> that it doesn't show in profiles anymore. Plus this field is O(columns). The ones we really have to optimize are the column chunks that are O(columns * row groups).

alkis

@emkornfield I have a prototype with flatbuffers that generates footers of similar size to the current thrift protocol. I left some notes in this PR with some of the optimizations I did to achieve this.

That said I feel the end state with flatbuffers is much better. The parse time (with verification) is already at 2.5ms/0.8ms for footer1/footer2 respectively, without substantially deviating from the current thrift object model. There is a lot of room for improvements if we deviate.

Even without extra work, at this point we have:

2.5ms parse time for 3MB footer (1 shot fetch time 40ms) means our parse/fetch time ratio is <7%, 1ms parse time for 2MB footer (1 shot fetch time 30ms) for <3% ratio
we have no cliffs: full scans, partial scans, all observe good performance (whereas with lazy parsing of metadata full scans will not be improved, or worse will suffer)

Should we join forces and iterate to make flatbuffers even better, while we collect footers from the fleet to compile a benchmark database to further validate the results? I feel there is a lot of room for shrinking Statistics further at which point we will have both smaller and >10x faster to parse footers.

emkornfield · 2024-05-31T16:33:32Z

@emkornfield I have a prototype with flatbuffers that generates footers of similar size to the current thrift protocol. I left some notes in this PR with some of the optimizations I did to achieve this.

I tried to respond to the all the comments.

That said I feel the end state with flatbuffers is much better. The parse time (with verification) is already at 2.5ms/0.8ms for footer1/footer2 respectively, without substantially deviating from the current thrift object model. There is a lot of room for improvements if we deviate.

I tend to agree, I think we should pose this question on the ML thread to see what peoples thoughts are on going directly flatbuffers vs something more iterative on thrift. Deviation might be fine, but I think in terms of development effort we should be aiming for something that we can go back and forth without loss of data with the original footer. I think this would likely make it much easier for implementations to adopt flatbuffers and simply translate them back to Thrift structures as a bare minimum of support.

Should we join forces and iterate to make flatbuffers even better, while we collect footers from the fleet to compile a benchmark database to further validate the results? I feel there is a lot of room for shrinking Statistics further at which point we will have both smaller and >10x faster to parse footers.

I'm happy to collaborate on the flatbuffers approach. Do you want to open up a draft PR with your current state (maybe update to include the responses here). I think the last major sticking point might be EncodingStatistics and I wonder how much that effects flatbuffers, or if we can use the trick here to embed those within there own string either as flatbuffers or keep them as thrift).

emkornfield · 2024-06-01T08:59:26Z

@alkis posted a question to ML to get more thoughts on thrift vs flat buffers. I think before we commit fully to flat buffers I might try to prototype a more shredded version of this that effectively uses mini parquet files for each of the core structs to see what that might look like

JFinis · 2024-06-03T08:49:36Z

src/main/thrift/parquet.thrift

+ /** 
+   * The index to the SchemaElement in FileMetadata for this 
+   * column.
+   */
+  12: optional i32 schema_index 


what use case is there for having column chunks be out of order? Wouldn't it be easier to just specify that the column chunks have to be in the same order as the schema?

Also note that without the order being the same as the schema, the whole "random access" idea of this PR goes out of the window.

It doesn't help me to have a MetadataPage with random access, if I don't know which column I need to access in the first place.

Instead, what I want is that if I want to access the third column in the schema, then I need to access the third column chunk.

As discussed on the mailing list, how about this alternative design:

How about just turning things around: Instead of having a schema_index in the ColumnMetadata, we could have a column_metadata_index in the schema. If that index is missing/-1, then this signifies that the column is empty, so no metadata will be present for it. With this, we would get the best of both worlds: We would always have O(1) random I/O even in case of such empty columns (as we would use the column_metadata_index for the lookup) and we would not need to store any ColumnMetadata for empty columns.

After given this a second thought, this also makes more sense in general. As the navigation direction is usually always from schema to metadata (not vice versa!), the schema should point us to the correct metadata instead of the metadata pointing us to the correct schema entry.

@JFinis this would only be useful for columns which are empty throughout all row groups, right? As opposed to skipping some column chunks but not all.

emkornfield · 2024-06-03T15:50:23Z

@alkis posted a question to ML to get more thoughts on thrift vs flat buffers. I think before we commit fully to flat buffers I might try to prototype a more shredded version of this that effectively uses mini parquet files for each of the core structs to see what that might look like

It seems like we might be headed towards flatbuffers as a consenus, so I will put this on the back-burner.

mhaseeb123 · 2024-06-13T02:38:23Z

README.md

@@ -118,6 +118,51 @@ chunks they are interested in.  The columns chunks should then be read sequentia

 ![File Layout](https://raw.github.com/apache/parquet-format/master/doc/images/FileLayout.gif)

+ ### PAR3 File Footers
+
+ PAR3 file footer footer format designed to better support wider-schemas and more control


Minor:

Suggested change

PAR3 file footer footer format designed to better support wider-schemas and more control

PAR3 file footer footer format designed to better support wider-schemas and more control

Suggested change

PAR3 file footer footer format designed to better support wider-schemas and more control

PAR3 file footer format designed to better support wider-schemas and more control

emkornfield added 2 commits May 27, 2024 20:25

DRAFT: Alternative V3 metadata proposal.

8a9e2f3

Salient points: 1. Introduce a new encoding that allows random access for byte arrays 2. Use page info as a structure for storing lists. 3. Storage pages out of line of thrift.

d

9340c40

emkornfield mentioned this pull request May 27, 2024

DRAFT: Parquet 3 metadata with decoupled column metadata #242

Draft

3 tasks

steveloughran reviewed May 28, 2024

View reviewed changes

pitrou reviewed May 28, 2024

View reviewed changes

README.md Outdated Show resolved Hide resolved

pitrou reviewed May 28, 2024

View reviewed changes

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved

pitrou reviewed May 28, 2024

View reviewed changes

README.md Outdated Show resolved Hide resolved

pitrou reviewed May 28, 2024

View reviewed changes

emkornfield commented May 28, 2024

View reviewed changes

emkornfield added 2 commits May 28, 2024 13:03

make bit mask 8 bytes

6804818

fix some grammar

c6f18b2

ggershinsky reviewed May 30, 2024

View reviewed changes

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved

Address comments and make proposal more complete.

40769c2

ggershinsky reviewed May 30, 2024

View reviewed changes

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved

Remove some out of date comments and fix types on SchemaElement

150cf3c

alkis reviewed May 30, 2024

View reviewed changes

wgtmac reviewed May 30, 2024

View reviewed changes

etseidl reviewed May 30, 2024

View reviewed changes

alkis reviewed May 31, 2024

View reviewed changes

JFinis reviewed Jun 3, 2024

View reviewed changes

mhaseeb123 reviewed Jun 13, 2024

View reviewed changes

		@@ -792,12 +806,23 @@ struct ColumnMetaData {
		6: required i64 total_uncompressed_size

	PAR3 file footer footer format designed to better support wider-schemas and more control
	PAR3 file footer footer format designed to better support wider-schemas and more control

	PAR3 file footer footer format designed to better support wider-schemas and more control
	PAR3 file footer format designed to better support wider-schemas and more control

DRAFT: Alternative Strawman proposal for a new V3 footer format in Parquet #250

Are you sure you want to change the base?

DRAFT: Alternative Strawman proposal for a new V3 footer format in Parquet #250

Conversation

emkornfield commented May 27, 2024

steveloughran left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emkornfield May 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emkornfield commented May 28, 2024

emkornfield left a comment

Choose a reason for hiding this comment

emkornfield commented May 29, 2024

alkis left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emkornfield May 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emkornfield May 31, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emkornfield commented May 30, 2024

alkis commented May 30, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

etseidl May 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emkornfield commented May 30, 2024

steveloughran commented May 30, 2024

wgtmac commented May 31, 2024

Choose a reason for hiding this comment

emkornfield May 29, 2024 •

edited

Loading

alkis left a comment •

edited

Loading

emkornfield May 30, 2024 •

edited

Loading

emkornfield May 31, 2024 •

edited

Loading

etseidl May 30, 2024 •

edited

Loading