Skip to content

Commit

Permalink
GH-68: Match language from parquet-format after merge of PARQUET-2139 (
Browse files Browse the repository at this point in the history
  • Loading branch information
etseidl authored Jul 8, 2024
1 parent 19eb00f commit a407d81
Show file tree
Hide file tree
Showing 2 changed files with 22 additions and 13 deletions.
22 changes: 11 additions & 11 deletions content/en/docs/File Format/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,29 +11,29 @@ This file and the thrift definition should be read together to understand the fo

```
4-byte magic number "PAR1"
<Column 1 Chunk 1 + Column Metadata>
<Column 2 Chunk 1 + Column Metadata>
<Column 1 Chunk 1>
<Column 2 Chunk 1>
...
<Column N Chunk 1 + Column Metadata>
<Column 1 Chunk 2 + Column Metadata>
<Column 2 Chunk 2 + Column Metadata>
<Column N Chunk 1>
<Column 1 Chunk 2>
<Column 2 Chunk 2>
...
<Column N Chunk 2 + Column Metadata>
<Column N Chunk 2>
...
<Column 1 Chunk M + Column Metadata>
<Column 2 Chunk M + Column Metadata>
<Column 1 Chunk M>
<Column 2 Chunk M>
...
<Column N Chunk M + Column Metadata>
<Column N Chunk M>
File Metadata
4-byte length in bytes of file metadata (little endian)
4-byte magic number "PAR1"
```
In the above example, there are N columns in this table, split into M row
groups. The file metadata contains the locations of all the column metadata
groups. The file metadata contains the locations of all the column chunk
start locations. More details on what is contained in the metadata can be found
in the Thrift definition.

Metadata is written after the data to allow for single pass writing.
File metadata is written after the data to allow for single pass writing.

Readers are expected to first read the file metadata to find all the column
chunks they are interested in. The columns chunks should then be read sequentially.
Expand Down
13 changes: 11 additions & 2 deletions content/en/docs/File Format/metadata.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,17 @@ title: "Metadata"
linkTitle: "Metadata"
weight: 5
---
There are three types of metadata: file metadata, column (chunk) metadata and page
header metadata. All thrift structures are serialized using the TCompactProtocol.
There are two types of metadata: file metadata, and page header metadata.
In the diagram below, file metadata is described by the `FileMetaData`
structure. This file metadata provides offset and size information useful
when navigating the Parquet file. Page header metadata (`PageHeader` and
children in the diagram) is stored in-line with the page data, and is
used in the reading and decoding of said data.


All thrift structures are serialized using the TCompactProtocol. The full
definition of these structures is given in the Parquet
[Thrift definition](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift).


![File Layout](/images/FileFormat.gif)

0 comments on commit a407d81

Please sign in to comment.