DRAFT: Parquet 3 metadata with decoupled column metadata #242

pitrou · 2024-05-16T11:05:49Z

Parquet 3 metadata proposal

This is a very rough attempt at solving the problem of FileMetadata footprint and decoding cost, especially for Parquet files with many columns (think tens of thousands columns).

Context

This is in the context of the broader "Parquet v3" discussion on the mailing-list. A number of possible far-reaching changes are being collected in a document.

It is highly recommended that you read at least that document before commenting on this PR.

Specifically, some users would like to use Parquet files for data with tens of thousands of columns, and potentially hundreds or thousands of row groups. Reading the file-level metadata for such a file is prohibitively expensive given the current file structure where all column-level metadata is eagerly decoded as part of file-level metadata.

a new "Parquet 3" file structure with backwards compatibility with legacy readers
new Thrift structures allowing for decoupled decoding of file-level metadata and column metadata: file metadata is now O(n_columns + n_row_groups) instead of O(n_columns * n_row_groups)
removal of outdated, redundant or undesirable fields from the new structures

Jira

My PR addresses the following Parquet Jira issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR"
- https://issues.apache.org/jira/browse/PARQUET-XXX
- In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.

Commits

My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Documentation

In case of new functionality, my PR adds documentation that describes how to use it.
- All the public functions and the classes in the PR contain Javadoc that explain what it does

README.md

src/main/thrift/parquet.thrift

README.md

tustvold · 2024-05-16T12:21:25Z

FWIW I'd be very interested to see how far we can push the current data structures with approaches like apache/arrow-rs#5775, before reaching for format changes.

I'd also observe that the column statistics can already be stored separately from FileMetadata, and if you do so you're really only left with a couple of integers... The schema strikes me as a bigger potential bottleneck, but also one that I can't help feeling is unavoidable...

gszadovszky

Great initiative, @pitrou!

README.md

src/main/thrift/parquet.thrift

gszadovszky · 2024-05-16T12:18:17Z

src/main/thrift/parquet.thrift

@@ -885,6 +971,44 @@ struct ColumnChunk {
  9: optional binary encrypted_column_metadata
 }

+struct ColumnChunkV3 {
+  /** File where column data is stored. **/
+  1: optional string file_path


Do we want to keep this concept of having the metadata in a separate file? I did not see it working anywhere.

I think we do support it in PyArrow (see tests here and here), and I think Dask makes use of it.

@jorisvandenbossche @mrocklin Am I right?

This should be the summary metadata file, IIUC. cc @rdblue for more context.

Deferring this question from Dask's perspective to @fjetter and @phofl

(thanks for pinging us)

PyArrow provides write_metadata and parquet_dataset for such usecases. Original PR goes more in depth.
cc @jorisvandenbossche

And the benefits of using this _metadata index file should translate to cloud object stores too by reducing the number of objects/files to be read.

not really

forces a full read of all generated files in job commit, which even if done in parallel is really slow. If it were to be done, it'd be better off done on demand in the first query. (note, faster reads would improve this)

it doesn't work with the cloud committer design, which #1361 formalises without doing some bridging classes.

the reason for (2) is that the hadoop cloud-native committer design kept clear of making any changes to the superclass of ParquetOutputCommitter as it is a critical piece of code in so many existing workflows, and really hard to understand. Not just a co-recursive algorithm, but two intermingled algorithms, one of which lacks the correctness guarantees (failures during task commit can be recovered from).

with a move to table based formats rather than directory trees, that whole commit process becomes much easier as well as supporting atomic job commits on a table (including deletes!). And as you note, these formats can include schema info too.

For context, AFAIK the _metadata summary file was a practice originally used in Spark (and supported by Parquet-mr), and inspired on that for example also taken over in Dask. We then implemented support for this in Arrow C++ / PyArrow mostly based on the dask usage (as a downstream user of pyarrow). In the meantime though, Spark disabled writing those files by default a long time ago (https://issues.apache.org/jira/browse/SPARK-15719), and also dask stopped doing this 2 years ago (dask/dask#8901),

Another Parquet dev mailing list thread with some discussion about this: https://lists.apache.org/thread/142yj57c68s2ob5wkrs80xsjoksm7rb7

@adamreeve I see so the parquet file is one with all the metadata and all the data is in files pointed to by this singleton.

And the benefits of using this _metadata index file should translate to cloud object stores too by reducing the number of objects/files to be read.

not really

Sorry I couldn't really follow this argument, that sounds like a Hadoop specific problem. To me a cloud object store means something like S3, and for our use case we're mostly concerned with reducing the number of objects that need to be read to satisfy read queries that filter data and don't need to read all files in a dataset, as we have many concurrent jobs running and adding load on storage.

Another Parquet dev mailing list thread with some discussion about this: https://lists.apache.org/thread/142yj57c68s2ob5wkrs80xsjoksm7rb7

Much of the discussion there seems to be related to issues users can run into if doing things like overwriting Parquet files or having heterogeneous schemas, which this feature was not designed for. But it sounds like others have also found this feature useful. I think this quote from Patrick Woody matches our experience: "outstandingly useful when you have well laid out data with a sort-order"

@adamreeve I see so the parquet file is one with all the metadata and all the data is in files pointed to by this singleton.

Yes, exactly.

The _metadata file format could have been designed so that the file_path field wasn't needed in the column chunk metadata. But it's there now and provides value to users while adding minimal overhead to those not using it (missing fields require zero space in serialized messages if I've understood the Thrift Compact Protocol correctly).

README.md

pitrou · 2024-05-16T12:36:15Z

FWIW I'd be very interested to see how far we can push the current data structures with approaches like apache/arrow-rs#5775, before reaching for format changes.

At first sight this would be a Rust-specific optimization. Also, while such improvements are good in themselves, they don't address the fundamental issue that file metadata size is currently O(n_row_groups * n_columns).

I'd also observe that the column statistics can already be stored separately from FileMetadata, and if you do so you're really only left with a couple of integers...

The main change in this PR is that a RowGroupV3 structure is O(1), instead of O(n_columns) for a RowGroup. The rest are assorted improvements.

tustvold · 2024-05-16T12:38:11Z

they don't address the fundamental issue that file metadata size is currently O(n_row_groups * n_columns).

Is it not still - https://github.com/apache/parquet-format/pull/242/files#diff-834c5a8d91719350b20995ad99d1cb6d8d68332b9ac35694f40e375bdb2d3e7cR1337

Edit: Oh I see you lose the per row-groupness. Although there is nothing to prevent having one big row group...

pitrou · 2024-05-16T12:49:44Z

Hmm. If SchemaElementV3 is an issue, we might further decouple things I suppose. Though I'm not sure how one would look up columns by names without decoding all the schema elements.

Writing one big row group is of course possible, but it probably comes with its own problems (such as RAM consumption in the writer?).

tustvold · 2024-05-16T13:01:56Z

At first sight this would be a Rust-specific optimization

The same optimisation could be done in C++, borrows are just pointers with compiler enforced lifetimes, but I accept it might be harder to achieve something similar in managed languages like Java without at least some cost.

pitrou · 2024-05-16T13:03:42Z

The same optimisation could be done in C++, borrows are just pointers with compiler enforced lifetimes

This assumes the Thrift C++ APIs allow this.

src/main/thrift/parquet.thrift

wgtmac · 2024-05-16T15:20:32Z

src/main/thrift/parquet.thrift

+  3: optional list<SortingColumn> sorting_columns
+
+  /** REMOVED from v1: file_offset.
+   * Use the OffsetIndex for each column instead.


I believe this is useful when we want to estimate the whole read range but do not want to read offset index.

So my question would be: how do you estimate the read range if you only have the file offset, but not the on-disk length? Using total_compressed_size perhaps?

total_compressed_size should give the range I think.

IIUC, the current spec does not prohibit placing column chunks in random order. So we must use file_offset together with total_compressed_size to determine the read range. This is a trick used to place small column chunks together which may be fetched in a single I/O.

pitrou · 2024-05-16T15:26:43Z

I added a "PAR3 without legacy metadata" variation for the distant future.

pitrou · 2024-05-16T15:32:33Z

README.md

+will notice the "PAR3" magic number just before the File Metadata and will
+instead read and decode the File Metadata v3.
+
+### Parquet 3 without legacy metadata


@wgtmac @gszadovszky Let me know if it is a good idea to even mention this :-)

my 2 cents I think at some point we want to avoid the duplication.

I agree. There will be a time when people will agree to remove the old metadata. I added the "PAR1" magic number specifically for this :)

kiszk · 2024-05-16T16:08:04Z

This is not directly related to a new structure. However, it would be a good opportunity to explicitly declare the endianness of data and meta-data.

emkornfield · 2024-05-16T18:23:08Z

src/main/thrift/parquet.thrift

+  1: required i32 version
+
+  /** Parquet schema for this file **/
+  2: required list<SchemaElementV3> schema;


Lists cannot really be decoded in a random access manner. I suggest in V3 we should consider moving any list elements to a Page that has type byte_array where each element is a serialized struct (thrift or something else if we choose to move away from it).

For heavily nested nested lists we might want to separate type specific fields.

Does it matter here?

No, not here. I think I need to do a more formal auditing to see what makes sense.

skip lists do improve list random IO perf, or some variant where as the element list is built up somehow an offset to a later list element is added. But I don't know of any easy way to do that here.

My concrete suggestion is to introduce a page encoding that supports random access: #250 which IIUC is similar to the approach described here for Columns but allows for the solution to be more easily generalized.

If you build random access at the encoding level, then how about the compression step?

emkornfield · 2024-05-16T18:29:33Z

README.md

+<File-level Column N Metadata v3>
+
+File Metadata v3
+4-byte length in bytes of File Metadata v3 (little endian)


i think we might want a slightly extended. In particular, I think it is possible we want a digest (e.g. sha256) of the v3 header. This can serve two purposes:

Be able to truly disambiguate the unlikely case that par3 ends here (e.g. unencoded data page where the last value in "PAR3".

Additional ability to ensure contents are reliable.

Second I think we might want to record an offset to the "full footer" containing additional index information, etc.

lastly, we might want to consider if compression is should be a setting, if we move enough stuff out of thrift this probably isn't a concern anymore.

How many users have asked to sha256-protect the header, or how likely is it to get a corrupt header in the real world? We shouldn't start making gratuitous additions that are not backed by concrete use cases.

I don't know what you mean with "full footer", are you talking about the FileMetadata struct (which is still here, just below)? Something else?

As for a false positive "PAR3", there is indeed a small risk, though the additional "PAR3" at the beginning of the file should help disambiguate.

What does compression have to do here? I'm not following you.

A very small fraction likely, a lighter-weight digest is also fine, we have digests in other parts of the spec, and I think the main reasion for likely not having it on the footer was to avoid compatibility issues.

FileMetadata + Serialized metadata like indeces/bloom filters and anything we move to the data page. after all the column chunks is what I mean by "Full Footer"

It isn't clear to me that everyone will check the header. This adds an additional IO for not too much benefit unless, the entire file is being retrieved from disk.

Compressing the thrift serialized data to minimize size if consumers want the ultimate smallest file.

I guess we can add a CRC32 here if other people want it.

@emkornfield IIUC the digest is not to protect for corruption but to make sure we do not mistakenly read a V3 footer in a file without one if we happen to see "PAR3" bytes before the V1 footer, correct?

I think it can serve both purposes.

I posit the latter is more important. Reading a valid encoded PAR1 file as PAR3 by accident is unlikely. But in zettabytes of data stored in parquet globally it will happen. When it happens someone is going to come here with a pitchfork.

Whereever we store a new metadata footer, its content hash needs to be stored somewhere for verification. A crc32 of the footer should be good enough.

Whereever we store a new metadata footer, its content hash needs to be stored somewhere for verification. A crc32 of the footer should be good enough.

This is exactly what I'm suggesting, I think it solves both use-cases (sha1 is probably overkill). Today there is no digest on the footer as far as I'm know.

emkornfield · 2024-05-16T18:32:36Z

src/main/thrift/parquet.thrift

+  2: required i64 num_rows
+
+  /** If set, specifies a sort ordering of the rows in this row group. */
+  3: optional list<SortingColumn> sorting_columns


I wonder if this should be per file. I guess the downside is expense concatenating files with different sort orders.

I have no idea. Not knowing the precise use cases for this, I would err on the side of safety and not change this.

I was going to make the same comment. I would imagine the point of sorting_columns is to tip off a query engine as to which columns can be used for filtering (and perhaps also columns used as a compount key?). I can't see how it would make sense for this to vary per row group. Since we're blowing things up, I'd vote to move this to FileMetaData.

I think this is kind of the same rationale as the codecs (easily being able to concatenate row groups with different sort orders)

I think this is kind of the same rationale as the codecs (easily being able to concatenate row groups with different sort orders)

Yes, but it can be worked around by simply dropping sort order metadata if row groups have inconsistent values. This is different to codec.

That is fair, but why lose the information per row-group it seems some engines could make sure of this, although it is probably superceded by page indexes?

src/main/thrift/parquet.thrift

emkornfield · 2024-05-16T18:35:37Z

src/main/thrift/parquet.thrift

+   */
+
+  /** Total byte size of all the uncompressed column data in this row group **/
+  1: required i64 total_byte_size


I'm not sure if this field is particularly useful, as it doesn't account for encodings. We might want a different field name with a clearer meaning.

This field was introduced in 2013, I'm not sure anyone still has a memory of the reasons. The author is named Nong Li, without a corresponding GitHub account. @nongli Is it you?
Also ping @julienledem, who was there too.

It seems to be used by parquet-mr's ParquetInputFormat.getSplits, which hasn't changed much since 2014. It doesn't seemed used elsewhere in parquet-mr, and isn't used by Parquet C++ at all.

I guess it might be used for a proxy as to actual data size, but it isn't a good one maybe we can keep it and always add another better field.

I'm ok with removing this, FTR, I would just like to make sure this isn't actually useful for some unexpected use case.

I believe it was added before that. this commit is just referencing when we created the separate parquet-format repo. It might well be part of the original metadata. The redfile.thrift name refers to the old RedElm name before we rename it Parquet. It is the size of the decompressed (after decompression, before decoding) page. Agreed it's not super useful.

emkornfield · 2024-05-16T18:38:14Z

src/main/thrift/parquet.thrift

+  6: optional string created_by
+
+  /** NEW from v1: byte offset of FileColumnMetadataV3, for each column **/
+  7: required list<i64> file_column_metadata_offset;


is this supposed to be at the beginning of the column chunk? Or is it part of the footer?

I think having the column information in the footer is probably still useful.

By "in the footer", you mean after all the data?
In the README above, it is laid out thusly, so I suppose it is in the footer :-)

[...] <File-level Column 1 Metadata v3> ... <File-level Column N Metadata v3> File Metadata v3 [...]

Though, of course, this is purely advisory and writers are free to place them elsewhere.

emkornfield · 2024-05-16T18:47:55Z

src/main/thrift/parquet.thrift

+
+  /** NEW from v1: Optional key/value metadata for this column at the file level
+   **/
+  3: optional list<KeyValue> key_value_metadata


we might want to model this as a two data pages (key and value) or follow the suggestion above for simply using a page to individually store the encodings.

This is file-level metadata. Do we have reasons to believe there will be enough key-values to warrant the complexity of using dedicated data pages for this?

The more we deviate from Parquet 1 file organization, the more work it will create for implementors and the more potential for incompatibilites and bugs.

We should perhaps ask on the ML for opinions...

Edit: started a discussion on https://lists.apache.org/thread/9z4o0zbz34lt5jtllzfrr4gmxczddqyb

it is mostly due to the fact that I think we want to follow a policy of "lazy" decoding as much as possible.

I'm not convinced that "lazy as much as possible" is that desirable. "Lazy where necessary" sounds more reasonable to me. Hence my question about typical metadata sizes.

Putting all user-defined metadata in a list is subject to limitations from thrift. That's why we have to care about its size. Switching it to separate "page" or something may unblock other potentials like putting extra user-defined secondary index. For now, we can only choose a "black hole" from somewhere in the file and put its offset/length pair into the key_value_metadata if we want to add custom index.

Putting all user-defined metadata in a list is subject to limitations from thrift. That's why we have to care about its size.

Ok, but is it a practical concern? In Parquet C++ we have:

constexpr int32_t kDefaultThriftStringSizeLimit = 100 * 1000 * 1000; // Structs in the thrift definition are relatively large (at least 300 bytes). // This limits total memory to the same order of magnitude as // kDefaultStringSizeLimit. constexpr int32_t kDefaultThriftContainerSizeLimit = 1000 * 1000;

For now, we can only choose a "black hole" from somewhere in the file and put its offset/length pair into the key_value_metadata if we want to add custom index.

Well, you could also have a special-named column with 1 defined BYTE_ARRAY value for the piece of metadata you care about (or you could also model it more finely using Parquet types).

corwinjoy · 2024-05-16T21:00:47Z

src/main/thrift/parquet.thrift

+  3: required i64 num_rows
+
+  /** Row groups in this file **/
+  4: required list<RowGroupV3> row_groups


As per discussion above, we would really like to move away from using a list for the row_groups so that individual row_groups can be read in a random access way. That is, we don't have to read data about all the row_groups if we just want a single row group from the parquet file.

Perhaps that's true but is this something an engine would do other than when dealing with a sampling read or limit query?

Note that RowGroupV3 is heavily reduced compared to RowGroup. That said, I understand the concern.

Perhaps that's true but is this something an engine would do other than when dealing with a sampling read or limit query?

I think yes? An engine might dispatch individual row-groups to different workers if there is external metadata to support knowing the number of row-group in a file before hand. Again given reduced size of row-groups this might be too much of a micro-optimization.

corwinjoy · 2024-05-16T21:39:40Z

@pitrou In conjunction with this change, if we want improved random access for row groups and columns I think this would also be a good time to upgrade the OffsetIndex / ColumnIndex in two key ways:

Have OffsetIndex be stored in a random access way rather than using a list so that an individual page chunk can be loaded without needing to read the entire OffsetIndex array.
Have OffsetIndex explicitly include the dictionary page in addition to any data pages so that column data can be directly loaded from the OffsetIndex without needing to get all offsets from the metadata.

I think this would make the ColumnIndex a lot more powerful as it could then be used for projection pushdown in a much faster way without the large overhead it has now.

emkornfield · 2024-05-16T22:02:18Z

@pitrou In conjunction with this change, if we want improved random access for row groups and columns I think this would also be a good time to upgrade the OffsetIndex / ColumnIndex in two key ways:

Have OffsetIndex be stored in a random access way rather than using a list so that an individual page chunk can be loaded without needing to read the entire OffsetIndex array.

Have OffsetIndex explicitly include the dictionary page in addition to any data pages so that column data can be directly loaded from the OffsetIndex without needing to get all offsets from the metadata.

I think this would make the ColumnIndex a lot more powerful as it could then be used for projection pushdown in a much faster way without the large overhead it has now.

@corwinjoy IMO, I think these are reasonable suggestions, but I think they can be handled as a follow-up once we align on design principles here. In general for dictionaries (and other "auxiliary") metadata we should maybe consider this more holistically, on how pages can be linked effectively.

emkornfield · 2024-05-16T23:55:14Z

src/main/thrift/parquet.thrift

@@ -467,6 +467,35 @@ struct SchemaElement {
  10: optional LogicalType logicalType
 }

+struct SchemaElementV3 {


Antoine, on the question of keeping implementation simpler, would it pay to not revise this and just reuse the existing one?

+1 these changes are not necessary, we should mark the relevant bits in SchemaElement deprecated instead.

We can, of course, but deprecated fields will probably never be removed from the format (except perhaps in 20 years).

Are deprecated fields a problem?

Assuming a writer writes them, the writer wastes cpu and bytes.
A reader can chooce to parse them or ignore them (by removing/commenting them out) from the .thrift file. The latter means the deprecated fields will be ignored by the thrift parser.

Mandating that readers remove fields from the official Parquet Thrift file sounds like a complicated and error-prone arrangement.

There is no mandate. The official parquet thrift will comment them out.

Writers compiled with old version of the official thrift file may write the fields.

Writers compiled with new version of the official thrift file won't write the fields.

Readers compiled with old version of the official thrift file may read the fields.

Readers compiled with new version of the official thrift file will ignore the fields.

The problem is that, for any given implementation, we'll have either 1+3 (compiled with old version: backwards compatibility but no perf improvement when reading), or 2+4 (compiled with new version: better performance on read, but backwards compatibility is lost).

This doesn't satisfy the requirements we're trying to satisfy here.

Oh I see. The problem is old readers and new writers.

I'm OK with no cleanup of deprecated fields. Marking them semantically as logically required for PAR1 footers seems reasonable to me. I hope at some point standalone PAR3 would become the default and they would not need to be written if they present meaningful overhead once that happens.

from a wire compatibility perspective required->optional should be forward/backward compatible.

emkornfield · 2024-05-16T23:55:37Z

src/main/thrift/parquet.thrift

+  /** Nested fields. */
+  5: optional i32 num_children;
+
+  /** CHANGED from v1: from i32 to i64


have you encountered a use-case for this? 2 Billion field ID seems quite high?

Same question!

Same. I don't think this change is needed.

Ok. It seemed a bit arbitrary to limit the width of this, but I can undo the change.

I think it is backwards compatible and in the long run, it probably isn't a big deal either way. I as just curious if there as a reason?

No concrete reason actually :-)

tustvold · 2024-05-16T23:56:28Z

Perhaps we could articulate the concrete use-cases we want to support with this? I understand that there is a desire to support extremely wide schemas of say 10,000 columns, but the precise nature of these columns eludes me?

The reason I ask this is if we stick with a standard page size of 1MB, then a 10,000 wide table with even distribution across the columns is unlikely to ever need multiple row groups - it will be 10GB just with just a single row group. This seems at odds with the stated motivation of this PR to avoid scaling per row group, which makes me think I am missing something.

Perhaps the use-case involves much smaller column chunks than normal, which would imply small pages, which might require changes beyond metadata if we want to support effectively? But at the same time I struggle to see why you would want to do this?

As an aside I did some toy benchmarking of parquet-rs, and confirmed that using thrift is perfectly fine, and can perform on par with flatbuffers - apache/arrow-rs#5770 (comment). It's a toy benchmark and should therefore be taken with a big grain of salt, but it at least would suggest 1us per column chunk is feasible

emkornfield · 2024-05-17T00:01:58Z

README.md

+
+File Metadata v3
+4-byte length in bytes of File Metadata v3 (little endian)
+4-byte magic number "PAR3"


one more thought is potentially having a 16-bit or 32-bit flag map, initially set to zero if we do want to allow for future iterations on metadata intepretation.

Data point: we added this to the Arrow IPC format and nobody to my knowledge is making use of it.

It seems to be superior to ParquetVersion of parquet-cpp in terms of representing a collection of enabled features.

But what would be a concrete use case for those feature flags?

Test if the reader is compatible and time to upgrade?

Readers can already scan the metadata and see if they recognize all encodings, etc.
Also, what would be the point of erroring out if a feature is optional?

There is a lot of current discussion on evolution, some of these might come fruitition some of them won't.

As a concrete example if we have a bitmap, we not longer need different magic footer/header values for compression.

Other potential items that we might not want to close off in :

Move to flatbuffers in the future.

Allow for discontinous column chunks (i.e. require offset index to read data)

Ok, we can perhaps add 4 reserved bytes just before the magic number. But I would recommend against giving them any specific meaning until necessary.

Sorry, for the concrete use case I meant to say "encryption" which I think we could use immediately?

You're right that encryption would be an immediate use case (as you did in #250, btw).

emkornfield · 2024-05-17T00:20:52Z

Perhaps we could articulate the concrete use-cases we want to support with this? I understand that there is a desire to support extremely wide schemas of say 10,000 columns, but the precise nature of these columns eludes me?

At least in datasets I've seen there are a small number of rows at least filtering (i.e. more columns then rows).

julienledem · 2024-05-17T01:42:07Z

Thank you Antoine
On the mailing list Micah is collecting feedback in a document.
https://lists.apache.org/thread/61z98xgq2f76jxfjgn5xfq1jhxwm3jwf

Would you mind putting your feedback there?
We should collect the goals before jumping to solutions.
It is a bit difficult to discuss goals directly in the thrift metadata.

mhaseeb123 · 2024-05-21T01:09:03Z

README.md

+4-byte magic number "PAR1"
+4-byte magic number "PAR3"
+
+<Column 1 Chunk 1 + Column Metadata>


Wouldn't <Column 1 Chunk 1 + Column 1 Chunk 1 Metadata> and so on be better here according to the V3 metadata format?

I've literally copied this from the original snippet above. This proposal doesn't change anything in the layout here, so it would be confusing if I expressed it differently, IMHO.

alkis · 2024-05-21T08:04:24Z

README.md

+This file structure is backwards-compatible. Parquet 1 readers will read and
+decode the legacy File Metadata in the file footer, while Parquet 3 readers
+will notice the "PAR3" magic number just before the File Metadata and will
+instead read and decode the File Metadata v3.


Efficient I/O can be challenging with this proposal. A reader needs to read the last 8 bytes of the file, then read 8 bytes before the legacy footer, figure out a v3 footer exists and then read that.

It would be better if the v3 metadata are at the end of the file, right before the 4-byte len + PAR1.

Can you suggest a layout that preserves compatibility with PAR1 readers?

There are a few options:

Use FileMetadata.version to introduce a new version of the metadata. Starting from the minimal change that can be done in place (DRAFT: Incremental improvements to parquet metadata #248) we can bump the version and remove columns from RowGroup and decouple the column metadata completely.

Add a binary field to FileMetadata named v3_metadata with tag number 10003. This field will encode the flatbuffer/thrift representation of the new footer. This field is going to be encoded last by thrift. Readers can manually look at the tail of the file and if they find this field, they can ignore the rest of the footer and parse these bytes only, ignoring the old style footer alltogether.

The goal here is twofold:

achieve better metadata parsing performance for PAR3 readers

keep compatibility with PAR1 readers

This is why this proposal creates a separate array of structure types: so that PAR3 readers don't have to eagerly decode those pesky columns, while letting PAR1 readers correctly access column information.

I don't think any of your two proposals is able of achieving those two goals simultaneously, are they?
(admittedly, I'm not sure I understand proposal number 2, though it seems to require hand-coded Thrift parsing which doesn't sound like a tremendous idea)

I think for 2, thrift could avoid parsing it assuming that we still follow the pattern of nested footer.

e.g. <field_marker 10003 and byte size><[serialized v3 metadatadata] + <v3 trailing bits (length, digest, feature bitmask)>"PAR3">0x0000<footer size>PAR1 as long as the byte size in the thrift header accounts for everything through PAR3 (as @alkis mentions below) it should work.

So the encoding/serialization would be manual but on decoding old readers should automatically drop the unknown field (it is possible some thrift implementations retain unknown fields, I know proto does) (i.e. the field ID 10003 should never actually be modeled in the schema).

"note 0x0000" is the stop field for structs if I am reading the thrift spec correctly

So the trade-offs of doing this approach are:

A bit of extra data to be copied for readers accessing the original version.

A guaranteed lower bound on amount of IO operations for V3 since it is incorporated into v2

Potentially more memory utilization if accessing the original version if unknown fields are maintained by thrift implementation.

Effectively for doing the operation currently as proposed in V3 the trade-offs are reverse. I

With fetch latencies of that order, does the decode latency of the existing thrift payload even matter? I would be interested to see any empirical data that would suggest the current metadata structures are an appreciable bottleneck for use-cases involving a catalog. I created a mailing list thread related to this here

Thanks for the concrete numbers. This gives us more precise insight for us to work on (though not all Parquet files are read from S3; local reads and writes are a very common use case as well). However, I'm a bit surprised by your numbers because typical guidance for S3 involves larger reads (i.e. multi-megabyte).

On S3 reads:

Guidance from S3 involves larger reads because of cost: reads from S3 typically cost an API call and bandwidth/transfer volume is free (same zone). 8MB reads will cost half the price of 4MB reads. Each connection can go up to ~100MB/sec so just transfering 4MB of data is at least 25ms. If one is reading large files, doing 100s of 8MB reads in parallel will saturate any network card - which is typically what engines do.

I posit that the vast majority of parquet encoded data is in some cloud somewhere (like >90% of all data). Hence working well with object stores (high latency, immutable files) is a requirement for any change. This is also lesson 4 from An Empirical Evaluation of Columnar Storage Formats.

With fetch latencies of that order, does the decode latency of the existing thrift payload even matter?

Yes it does. With the numbers above in mind:

cold reads: baseline 110ms (as above). Optimized metadata are down to 2mb and parsing in 5ms translates to 20ms + 30ms + 5ms = 55ms --> 2x speedup

warm reads: footer bytes are cached on disk/s3express/something-with-low-latency. It takes 5ms fetch + 40ms parse. Optimized, it takes 2ms to fetch + 5ms to parse = 7ms --> 6x speedup

I posit that many PiB of parquet data lives in HDFS which also has read latencies, but not as bad as S3.
S3 standard has pretty bad latency for small byte reads plus you get billed. best to grab large amounts speculatively

I'm late to this thread and haven't fully read it. I just want to give my +1 that the current proposal would require too many I/O requests and thus make this basically a deal breaker for high latency storage like S3. We would not use this due to this.

Any change that increases the number of data-dependent requests necessary to decode it basically a deal breaker for us and I guess is so for a lot of other data lake companies.

alkis · 2024-05-21T08:05:01Z

README.md

+
+File Metadata v3
+4-byte length in bytes of File Metadata v3 (little endian)
+4-byte magic number "PAR3"


How do encrypted footers work in this case?

Hopefully, the same way they work for PAR1, but I agree this will need to be spelled out more explicitly :-)

Yep, might need some analysis. If we continue using "PARE", can the readers know if the file is v1 or v3?
Maybe we can use something like "PRE3" instead (PaRquet Encrypted, or "PQE3" - ParQuet Encrypted, or "PEF3" - parquet encrypted footer)

Ah, you're right. I thought the "PARE" footer was in addition to the "PAR1" footer, but apparently it replaces it. Darn.

Would using PAR3 + bit in a bitmap to represent encryption be a reasonable approach here to reduce the proliferation of magic footer values?

Sure, as long as we can read the bitmap before decrypting the footer

alkis · 2024-05-21T08:40:39Z

src/main/thrift/parquet.thrift

@@ -885,6 +971,44 @@ struct ColumnChunk {
  9: optional binary encrypted_column_metadata
 }

+struct ColumnChunkV3 {
+  /** File where column data is stored. **/
+  1: optional string file_path


+1 on removing this if we are doing breaking changes.

alkis · 2024-05-21T08:47:56Z

src/main/thrift/parquet.thrift

+  3: required i64 num_rows
+
+  /** Row groups in this file **/
+  4: required list<RowGroupV3> row_groups


Perhaps that's true but is this something an engine would do other than when dealing with a sampling read or limit query?

alkis · 2024-05-21T08:49:17Z

src/main/thrift/parquet.thrift

+  /** Nested fields. */
+  5: optional i32 num_children;
+
+  /** CHANGED from v1: from i32 to i64


Same. I don't think this change is needed.

alkis · 2024-05-21T08:50:07Z

src/main/thrift/parquet.thrift

@@ -467,6 +467,35 @@ struct SchemaElement {
  10: optional LogicalType logicalType
 }

+struct SchemaElementV3 {


+1 these changes are not necessary, we should mark the relevant bits in SchemaElement deprecated instead.

alkis · 2024-05-21T08:51:51Z

src/main/thrift/parquet.thrift

@@ -835,6 +864,65 @@ struct ColumnMetaData {
  16: optional SizeStatistics size_statistics;
 }

+struct ColumnChunkMetaDataV3 {


Similarly to SchemaElement, we can reuse the existing struct and deprecate the useless fields.

Also don't we need a index into the list of SchemaElement to reference it?

Unless I'm misunderstanding something, there should be one FileColumnMetadataV3 element per leaf SchemaElement.

Having an index referencing a SchemaElement means that:

a writer can skip encoding columns that do not have values in a rowgroup range

a writer can encode/write columns in different order than metadata

(1) is important when schemata are very wide but data is sparse.

Well, ignoring the fact that Parquet is currently not a sparse format, your proposal implies that readers have to do a O(n) search to find a given column?

readers have to do a O(n) search to find a given column?

Why would they need an O(n) search? The index indexes an SchemaElement[] in java or std::vector<SchemaElement> in C++ which is O(1).

Ok, let's step back because I think I am not following you.

Here is the file organization as proposed here:

FileMetaDataV3 points to one FileColumnMetadataV3 per column

FileColumnMetadataV3 points to one ColumnChunkV3 per row group

I'm not sure what you're proposing exactly when you mean "can skip encoding columns that do not have values"? This is currently not possible using Parquet, as at least the corresponding definition levels would be expected (even if they are all 0).

Well, ignoring the fact that Parquet is currently not a sparse format, your proposal implies that readers have to do a O(n) search to find a given column?

IIUC, Finding a column via schema elements today is also O(N) assuming no nesting. I think the difference is today the first thing implementations do create an efficient dictionary structure to amortize lookup of further columns.

I think if we want fast lookups without building any additional dictionaries in memory we should be considering a new stored index structure (or reconsider how we organize schema elements instead of a straight BFS).

alkis · 2024-05-21T08:54:11Z

src/main/thrift/parquet.thrift

+  /** REMOVED from v1: index_page_offset (unused in practice?) */
+
+  /** Compression codec **/
+  1: required CompressionCodec codec


This could go in the SchemaElement since it should not in principle vary between row groups.

Well, right now it's possible to make it vary. I don't know if any writers make use of this possibility, but I'm not sure there's any harm in keeping it.

Enforcing same codec to all row groups will prohibit fast merging row groups of different parquet files without rewriting chunk data. So I vote for keeping it as is.

alkis · 2024-05-21T08:59:45Z

src/main/thrift/parquet.thrift

The biggest culprit in parsing metadata is Statistics because every one of the values is a binary variable length thing. We could improve Statistics trivially and in place if we add a few fixed size fields:

struct Statistics { /** * DEPRECATED */ 1: optional binary max; 2: optional binary min; /** count of null value in the column */ 3: optional i64 null_count; /** count of distinct values occurring */ 4: optional i64 distinct_count; /** * Only one pair of min/max will be populated. For fixed sized types, one of the minN/maxN variants will be used. Otherwise min_value/max_value is used. */ 5: optional binary max_value; 6: optional binary min_value; 7: optional byte max1; 8: optional byte min1; 9: optional i32 max4; 10: optional i32 min4; 11: optional i64 max8; 12: optional i64 min8; }

Since that structure is deprecated, do we really want to "improve" it, or should we focus our efforts on non-deprecated structures such as ColumnIndex?

More generally, this should probably be discussed in a separate ML thread, for better separation of concerns.

I wasn't aware the whole structure was deprecated. I thought only max amd min fields are deprecated.

I agree this is a separate topic. I still want to say that Apache ORC has defined separate statistics for different types: https://github.com/apache/orc-format/blob/6e30e63f5071b616ec5cedaac7b2e0a98ae377c5/src/main/proto/orc/proto/orc_proto.proto#L87-L99. But the dirty thing is that we might still need some sort of encoding for data types which do not have direct representation from protobuf or thrift, e.g. decimal type.

Having an encoding shouldn't be a problem as long as the encoding is fast enough to decode (e.g. PLAIN), should it?

Agree separate topic. I think you still need variable length types that rely on byte_array

I agree with you. Defining specific statistics for different data types adds complexity as well.

Would a union work here?

I wasn't aware the whole structure was deprecated. I thought only max amd min fields are deprecated.

Ahah, sorry. I think you're right actually.

I thought only max amd min fields are deprecated.

Sigh, a unrelated issue is that currently min max might still be written even if min_value and max_value is being provided.

alkis

Replied to comments. I also put a strawman in the comments for minimal changes we can do to improve FileMetadata decoding speed: #248.

Most of the bottleneck in decoding FileMetadata are variable length fields: list<> and binary. With those removed/elided we can start reducing the pain today and get to a much better state when we have a solid design for larger change.

alkis · 2024-05-21T13:55:14Z

src/main/thrift/parquet.thrift

I wasn't aware the whole structure was deprecated. I thought only max amd min fields are deprecated.

alkis · 2024-05-21T13:57:02Z

src/main/thrift/parquet.thrift

@@ -835,6 +864,65 @@ struct ColumnMetaData {
  16: optional SizeStatistics size_statistics;
 }

+struct ColumnChunkMetaDataV3 {


Having an index referencing a SchemaElement means that:

a writer can skip encoding columns that do not have values in a rowgroup range

a writer can encode/write columns in different order than metadata

(1) is important when schemata are very wide but data is sparse.

alkis · 2024-05-21T15:29:14Z

README.md

+<File-level Column N Metadata v3>
+
+File Metadata v3
+4-byte length in bytes of File Metadata v3 (little endian)


@emkornfield IIUC the digest is not to protect for corruption but to make sure we do not mistakenly read a V3 footer in a file without one if we happen to see "PAR3" bytes before the V1 footer, correct?

alkis · 2024-05-21T15:41:21Z

README.md

+This file structure is backwards-compatible. Parquet 1 readers will read and
+decode the legacy File Metadata in the file footer, while Parquet 3 readers
+will notice the "PAR3" magic number just before the File Metadata and will
+instead read and decode the File Metadata v3.


There are a few options:

Use FileMetadata.version to introduce a new version of the metadata. Starting from the minimal change that can be done in place (DRAFT: Incremental improvements to parquet metadata #248) we can bump the version and remove columns from RowGroup and decouple the column metadata completely.

Add a binary field to FileMetadata named v3_metadata with tag number 10003. This field will encode the flatbuffer/thrift representation of the new footer. This field is going to be encoded last by thrift. Readers can manually look at the tail of the file and if they find this field, they can ignore the rest of the footer and parse these bytes only, ignoring the old style footer alltogether.

etseidl · 2024-05-21T16:18:16Z

src/main/thrift/parquet.thrift

+  /** NEW from v1: Byte length in file_path of ColumnChunkMetaDataV3, optionally encrypted
+   **/
+  3: required i32 metadata_file_length


Thank you :)

The naming is a little awkward...I read these as "metadata-file offset"/"metadata-file length". Perhaps instead just "metadata_offset" and "metadata_length"?

Aside: just curious, but in the current format how do you make use of the file_offset? Is there a way to deduce the length of the metadata? Or do you have to use a file based reader and seek to the offset?

BTW, do we still want to put a copy ColumnMetaData at the end of column chunk and another copy here at 4: optional binary encoded_metadata? I know it is good to keep backward compatibility. Does any implementation actually read it from the end of column chunk?

wgtmac · 2024-05-22T05:15:09Z

src/main/thrift/parquet.thrift

+  /** NEW from v1: byte offset of FileColumnMetadataV3, for each column **/
+  7: required list<i64> file_column_metadata_offset;
+  /** NEW from v1: byte length of FileColumnMetadataV3, for each column **/
+  8: required list<i32> file_column_metadata_length;


Is it too late to add something like below for all offset/length pair?

struct Reference { 1: i64 offset 2: i32 length }

If we're looking to speed up thrift deserialization, I'd bet two int lists are going to be faster to parse than a list of structs. If the metadata objects are contiguous, maybe instead extend the offsets list by one and use deltas to calculate the lengths.

I was going with the same intuition as @etseidl 's, though I don't have any precise insight into typical Thrift deserializer performance characteristics.

Mandating contiguous column metadata objects and using N+1 offsets is an intriguing idea. It could perhaps allow preloading many column metadata at once more easily.

+1 to N+1 offsets. They are going to parse a lot faster (>2x) than structs.

wgtmac · 2024-05-22T05:15:41Z

src/main/thrift/parquet.thrift

+  8: required list<i32> file_column_metadata_length;
+
+  /** REMOVED from v1: column_orders.
+   ** Use `FileColumnMetadataV3.column_order` instead.


Updated, I misread where this was being moved to, this seems fine.

src/main/thrift/parquet.thrift

wgtmac · 2024-05-22T05:37:17Z

src/main/thrift/parquet.thrift

+
+  /** NEW from v1: Optional key/value metadata for this column at the file level
+   **/
+  3: optional list<KeyValue> key_value_metadata


Putting all user-defined metadata in a list is subject to limitations from thrift. That's why we have to care about its size. Switching it to separate "page" or something may unblock other potentials like putting extra user-defined secondary index. For now, we can only choose a "black hole" from somewhere in the file and put its offset/length pair into the key_value_metadata if we want to add custom index.

wgtmac · 2024-05-22T05:41:38Z

src/main/thrift/parquet.thrift

+  3: optional list<SortingColumn> sorting_columns
+
+  /** REMOVED from v1: file_offset.
+   * Use the OffsetIndex for each column instead.


IIUC, the current spec does not prohibit placing column chunks in random order. So we must use file_offset together with total_compressed_size to determine the read range. This is a trick used to place small column chunks together which may be fetched in a single I/O.

wgtmac · 2024-05-22T05:44:54Z

src/main/thrift/parquet.thrift

+  2: required i64 num_rows
+
+  /** If set, specifies a sort ordering of the rows in this row group. */
+  3: optional list<SortingColumn> sorting_columns


I think this is kind of the same rationale as the codecs (easily being able to concatenate row groups with different sort orders)

Yes, but it can be worked around by simply dropping sort order metadata if row groups have inconsistent values. This is different to codec.

wgtmac · 2024-05-22T06:01:26Z

src/main/thrift/parquet.thrift

+  /** NEW from v1: Byte length in file_path of ColumnChunkMetaDataV3, optionally encrypted
+   **/
+  3: required i32 metadata_file_length


BTW, do we still want to put a copy ColumnMetaData at the end of column chunk and another copy here at 4: optional binary encoded_metadata? I know it is good to keep backward compatibility. Does any implementation actually read it from the end of column chunk?

ggershinsky · 2024-05-22T06:08:54Z

README.md

+...
+<Column N Chunk M + Column Metadata>
+
+<File-level Column 1 Metadata v3>


If we'll have a file-level column metadata, we can optimize the storage of key_metadata of encrypted columns - by keeping the column key_metadata here, instead of the ColumnChunk structures.
Note: this is a storage-only optimization [O(N) instead of O(NxM)]; the reader processes only one key_metadata object per column [O(N)] already today.

Assuming that each column uses a single key_metadata for all chunks, this sounds like a good idea.

ggershinsky · 2024-05-22T06:20:45Z

README.md

+## Encryption
+
+Encryption with footer encryption enabled changes the above file structure slightly.
+In particular, the "PAR1" magic number is replaced with "PARE".


TBD (#242 (comment))

ggershinsky · 2024-05-22T06:25:08Z

src/main/thrift/parquet.thrift

+  8: optional i32 column_index_length
+
+  /** Crypto metadata of encrypted columns **/
+  9: optional ColumnCryptoMetaData crypto_metadata


not needed if we keep this in FileColumnMetadataV3

Agreed especially if it can't be different for individual chunks

ggershinsky · 2024-05-22T06:36:37Z

src/main/thrift/parquet.thrift

@@ -1165,6 +1317,62 @@ struct FileMetaData {
  9: optional binary footer_signing_key_metadata
 }

+/** Metadata for a column in this file. */
+struct FileColumnMetadataV3 {
+  /** All column chunks in this file (one per row group) **/


add in this column in the comment? also, rename the field to chunks?

/** Metadata for a column in this file. */ struct FileColumnMetadataV3 { /** All column chunks in this column (in this file - one per row group) **/ 1: required list<ColumnChunkV3> chunks

ggershinsky · 2024-05-22T06:39:09Z

src/main/thrift/parquet.thrift

+  /** NEW from v1: Optional key/value metadata for this column at the file level
+   **/
+  3: optional list<KeyValue> key_value_metadata
+}


can add the column encryption key_metadata here,

/** Crypto metadata of encrypted column **/ 4: optional ColumnCryptoMetaData crypto_metadata

steveloughran

This design adds another GET/read to the footer read phase.

The offset to the v3 metadata needs to be stored at a fixed offset off the tail of the file, so a single read of the last N bytes will pull it in, just as is done for v1. Even that is inefficient as a single high-latency GET only gets back 8 bytes right now: apps should download 1+MB or more for the same cost.

Will v1 readers suffer if there is extra data at the tail of their normal metadata? If they all work today (first bit of regression testing...) then I'd actually propose a block at the end with more info, for this and later releases.

so you'd have something like

byte[8] v3_position (not offset, actual position)
byte[8] v3_length
byte[4] "PAR3"
byte[4] v1_offset
byte[4] magic ["PAR1" | "PARE"]

For this to work with existing readers which are given bytes [len-(v1-offset), len-8] to decode, the existing readers must be OK with some trailing bytes they don't recognise. Does this hold? including for encrypted metadata?

If does, a v3 reader could open a file, read at least the last 32 bytes, check for being par3 by the two magic numbers, then, if true, go to v3 metadata.
If it doesn't, things get worse when opening a file. It'd actually make sense to grab the last, say, 2-4 MB in the file and go from there.
This may be better anyway, which is why abfs/gcs connectors cache footers, ongoing s3a prefetch will probably go for the final 8 MB.
but do in parquet and it'd be consistent everywhere rather than hoping the layers underneath do the right thing.

(side issue, table indices such as iceberg should store this footer too, or at least key values)

This makes me realise there's another aspect to regression testing here: samples of v3 format files must be given to existing v2 readers to verify they don't break. Are there any repos which do this yet?

steveloughran · 2024-05-24T11:36:20Z

README.md

@@ -81,7 +81,13 @@ more pages.
  - Encoding/Compression - Page

 ## File format
-This file and the [Thrift definition](src/main/thrift/parquet.thrift) should be read together to understand the format.
+
+This file and the [Thrift definition](src/main/thrift/parquet.thrift) should be read


"MUST", RFC-2119 style, maybe

steveloughran · 2024-05-24T11:37:37Z

README.md

@@ -107,12 +113,97 @@ start locations.  More details on what is contained in the metadata can be found
 in the Thrift definition.

 Metadata is written after the data to allow for single pass writing.
+This is especially useful when writing to backends such as S3.


It is a requirement for HDFS too, S3 is simply a beneficiary of a design requirement from the outset of an append-only filesystem being the target from day 1

steveloughran · 2024-05-24T12:03:05Z

README.md

+This file structure is backwards-compatible. Parquet 1 readers will read and
+decode the legacy File Metadata in the file footer, while Parquet 3 readers
+will notice the "PAR3" magic number just before the File Metadata and will
+instead read and decode the File Metadata v3.


I posit that many PiB of parquet data lives in HDFS which also has read latencies, but not as bad as S3.
S3 standard has pretty bad latency for small byte reads plus you get billed. best to grab large amounts speculatively

steveloughran · 2024-05-24T12:04:12Z

README.md

+<File-level Column N Metadata v3>
+
+File Metadata v3
+4-byte length in bytes of File Metadata v3 (little endian)


offset + length

shangxinli · 2024-05-27T23:28:19Z

README.md


 Readers are expected to first read the file metadata to find all the column
 chunks they are interested in.  The columns chunks should then be read sequentially.

 ![File Layout](https://raw.github.com/apache/parquet-format/master/doc/images/FileLayout.gif)

+### Parquet 3


People, who are not involved in the discussion of Parquet 3, might wonder why suddenly Parquet '3'?

pitrou · 2024-06-04T12:25:22Z

@alkis @JFinis and others, just a quick note that you've convinced me that this proposal is suboptimal for footer access latency, and something better is required. I hope we can design something that's reasonably clear and easy to implement.

JFinis · 2024-06-06T15:16:15Z

Interesting, so this feature is basically used to create a file comparable to an Iceberg manifest. I see that it can be used for that. Design-wise, I'm not the biggest fan of this special casing this through an extra field instead of just storing a Parquet file that has all information in normal Parquet columns (like a DeltaLake checkpoint Parquet file), but the design is the way it is. Therefore, I do see that this field can be used this way and I guess therefore there is a valid use case for this, so it probably needs to be maintained for backward compatibility. Cheers, Jan Am Do., 6. Juni 2024 um 16:08 Uhr schrieb Rok Mihevc < ***@***.***>:

…

***@***.**** commented on this pull request. ------------------------------ In src/main/thrift/parquet.thrift <#242 (comment)> : > @@ -885,6 +971,44 @@ struct ColumnChunk { 9: optional binary encrypted_column_metadata } +struct ColumnChunkV3 { + /** File where column data is stored. **/ + 1: optional string file_path PyArrow provides write_metadata <https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_metadata.html> and parquet_dataset <https://arrow.apache.org/docs/python/generated/pyarrow.dataset.parquet_dataset.html#pyarrow.dataset.parquet_dataset> for such use case. — Reply to this email directly, view it on GitHub <#242 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AALLIYWBZ23WE4BG6MVBI7TZGBUNFAVCNFSM6AAAAABHZ7LAPSVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDCMBSGA4TKMRQGM> . You are receiving this because you were mentioned.Message ID: ***@***.***>

pitrou force-pushed the v3-metadata branch 3 times, most recently from f6ad6e1 to 4adccbc Compare May 16, 2024 11:24

mapleFU reviewed May 16, 2024

View reviewed changes

README.md Show resolved Hide resolved

src/main/thrift/parquet.thrift Show resolved Hide resolved

mapleFU reviewed May 16, 2024

View reviewed changes

README.md Outdated Show resolved Hide resolved

gszadovszky reviewed May 16, 2024

View reviewed changes

pitrou force-pushed the v3-metadata branch 2 times, most recently from 9d4b0bd to e6a9088 Compare May 16, 2024 15:01

wgtmac reviewed May 16, 2024

View reviewed changes

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved

wgtmac reviewed May 16, 2024

View reviewed changes

pitrou force-pushed the v3-metadata branch 2 times, most recently from 9c71ce0 to 3d651b8 Compare May 16, 2024 15:26

DRAFT: Parquet 3 metadata with decoupled column metadata

3c03830

pitrou force-pushed the v3-metadata branch from 3d651b8 to 3c03830 Compare May 16, 2024 15:32

pitrou commented May 16, 2024

View reviewed changes

emkornfield reviewed May 16, 2024

View reviewed changes

src/main/thrift/parquet.thrift Show resolved Hide resolved

emkornfield reviewed May 16, 2024

View reviewed changes

corwinjoy reviewed May 16, 2024

View reviewed changes

emkornfield reviewed May 16, 2024

View reviewed changes

emkornfield reviewed May 17, 2024

View reviewed changes

mhaseeb123 reviewed May 21, 2024

View reviewed changes

alkis reviewed May 21, 2024

View reviewed changes

alkis mentioned this pull request May 21, 2024

DRAFT: Incremental improvements to parquet metadata #248

Closed

3 tasks

alkis reviewed May 21, 2024

View reviewed changes

etseidl reviewed May 21, 2024

View reviewed changes

wgtmac reviewed May 22, 2024

View reviewed changes

ggershinsky reviewed May 22, 2024

View reviewed changes

steveloughran reviewed May 24, 2024

View reviewed changes

emkornfield mentioned this pull request May 27, 2024

DRAFT: Alternative Strawman proposal for a new V3 footer format in Parquet #250

Open

shangxinli reviewed May 27, 2024

View reviewed changes

etseidl mentioned this pull request May 29, 2024

Write parquet column chunk file_offset rapidsai/cudf#15882

Closed

3 tasks

wgtmac mentioned this pull request Jun 13, 2024

[C++][Parquet][Python] New API to 'zip' or (vertically) 'attach' parquet metadata apache/arrow#40958

Open

DRAFT: Parquet 3 metadata with decoupled column metadata #242

Are you sure you want to change the base?

DRAFT: Parquet 3 metadata with decoupled column metadata #242

Conversation

pitrou commented May 16, 2024 • edited Loading

Parquet 3 metadata proposal

Context

Contents

Jira

Commits

Documentation

tustvold commented May 16, 2024 • edited Loading

gszadovszky left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rok Jun 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche Jun 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pitrou commented May 16, 2024

tustvold commented May 16, 2024 • edited Loading

pitrou commented May 16, 2024

tustvold commented May 16, 2024

pitrou commented May 16, 2024

Choose a reason for hiding this comment

pitrou May 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pitrou commented May 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kiszk commented May 16, 2024

emkornfield May 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emkornfield May 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pitrou May 16, 2024 • edited Loading

Choose a reason for hiding this comment

pitrou May 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pitrou May 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pitrou commented May 16, 2024 •

edited

Loading

tustvold commented May 16, 2024 •

edited

Loading

rok Jun 6, 2024 •

edited

Loading

jorisvandenbossche Jun 6, 2024 •

edited

Loading

tustvold commented May 16, 2024 •

edited

Loading

pitrou May 16, 2024 •

edited

Loading

emkornfield May 16, 2024 •

edited

Loading

emkornfield May 16, 2024 •

edited

Loading

pitrou May 16, 2024 •

edited

Loading

pitrou May 16, 2024 •

edited

Loading

pitrou May 16, 2024 •

edited

Loading

emkornfield commented May 16, 2024 •

edited

Loading

emkornfield May 16, 2024 •

edited

Loading