From a407d81a41a90b58ae90a6567a84dd084b5d2947 Mon Sep 17 00:00:00 2001 From: Ed Seidl Date: Sun, 7 Jul 2024 19:25:32 -0700 Subject: [PATCH] GH-68: Match language from parquet-format after merge of PARQUET-2139 (#69) --- content/en/docs/File Format/_index.md | 22 +++++++++++----------- content/en/docs/File Format/metadata.md | 13 +++++++++++-- 2 files changed, 22 insertions(+), 13 deletions(-) diff --git a/content/en/docs/File Format/_index.md b/content/en/docs/File Format/_index.md index 7d49ccbe..3ca8fcec 100644 --- a/content/en/docs/File Format/_index.md +++ b/content/en/docs/File Format/_index.md @@ -11,29 +11,29 @@ This file and the thrift definition should be read together to understand the fo ``` 4-byte magic number "PAR1" - - + + ... - - - + + + ... - + ... - - + + ... - + File Metadata 4-byte length in bytes of file metadata (little endian) 4-byte magic number "PAR1" ``` In the above example, there are N columns in this table, split into M row -groups. The file metadata contains the locations of all the column metadata +groups. The file metadata contains the locations of all the column chunk start locations. More details on what is contained in the metadata can be found in the Thrift definition. -Metadata is written after the data to allow for single pass writing. +File metadata is written after the data to allow for single pass writing. Readers are expected to first read the file metadata to find all the column chunks they are interested in. The columns chunks should then be read sequentially. diff --git a/content/en/docs/File Format/metadata.md b/content/en/docs/File Format/metadata.md index a2eae253..f86b1608 100644 --- a/content/en/docs/File Format/metadata.md +++ b/content/en/docs/File Format/metadata.md @@ -3,8 +3,17 @@ title: "Metadata" linkTitle: "Metadata" weight: 5 --- -There are three types of metadata: file metadata, column (chunk) metadata and page -header metadata. All thrift structures are serialized using the TCompactProtocol. +There are two types of metadata: file metadata, and page header metadata. +In the diagram below, file metadata is described by the `FileMetaData` +structure. This file metadata provides offset and size information useful +when navigating the Parquet file. Page header metadata (`PageHeader` and +children in the diagram) is stored in-line with the page data, and is +used in the reading and decoding of said data. + + +All thrift structures are serialized using the TCompactProtocol. The full +definition of these structures is given in the Parquet +[Thrift definition](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift). ![File Layout](/images/FileFormat.gif)