Skip to content

Commit

Permalink
Merge remote-tracking branch 'origin/main'
Browse files Browse the repository at this point in the history
  • Loading branch information
Selfeer committed Dec 18, 2024
2 parents c8216a8 + 5d3a174 commit 3d5a45d
Show file tree
Hide file tree
Showing 4 changed files with 50 additions and 11 deletions.
20 changes: 10 additions & 10 deletions parquetify/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,16 +13,16 @@

# 🌟 Features

| Feature | Description |
|--------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Physical Data Types:** | All physical data types: `INT32`, `INT64`, `BOOLEAN`, `FLOAT`, `DOUBLE`, `BINARY`, `FIXED_LEN_BYTE_ARRAY`. |
| **Logical Data Types:** | Most logical types (except for `FLOAT16`): `UTF8`, `DECIMAL`, `DATE`, `TIME_MILLIS`, `TIME_MICROS`, `TIMESTAMP_MILLIS`, `TIMESTAMP_MICROS`, `ENUM`, `NONE`, `MAP`, `LIST`, `STRING`, `MAP_KEY_VALUE`, `TIME`, `INTEGER`, `JSON`, `BSON`, `UUID`, `INTERVAL`, `UINT_8`, `UINT_16`, `UINT_32`, `UINT_64`, `INT_8`, `INT_16`, `INT_32`, `INT_64` |
| **Precision & Scale:** | Precision and scale for `DECIMAL` types. |
| **Compression:** | `NONE`, `SNAPPY`, `GZIP`, `LZO`, `BROTLI`, `LZ4`, `ZSTD`. |
| **Encodings:** | Automatically set by the writer for a given column. |
| **Bloom Filter:** | Apply a bloom filter to specific columns or all columns (including those within groups). |
| **Writer Version:** | Specify writer version (`1.0`, `2.0`). |
| **Customizable Sizes:** | Row group and page sizes. |
| Feature | Description |
|--------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Physical Data Types:** | All physical data types: `INT32`, `INT64`, `BOOLEAN`, `FLOAT`, `DOUBLE`, `BINARY`, `FIXED_LEN_BYTE_ARRAY`. |
| **Logical Data Types:** | Most logical types: `UTF8`, `DECIMAL`, `DATE`, `TIME_MILLIS`, `TIME_MICROS`, `TIMESTAMP_MILLIS`, `TIMESTAMP_MICROS`, `ENUM`, `NONE`, `MAP`, `LIST`, `STRING`, `MAP_KEY_VALUE`, `TIME`, `INTEGER`, `JSON`, `BSON`, `UUID`, `INTERVAL`, `UINT_8`, `UINT_16`, `UINT_32`, `UINT_64`, `INT_8`, `INT_16`, `INT_32`, `INT_64`, `FLOAT16` |
| **Precision & Scale:** | Precision and scale for `DECIMAL` types. |
| **Compression:** | `NONE`, `SNAPPY`, `GZIP`, `LZO`, `BROTLI`, `LZ4`, `ZSTD`. |
| **Encodings:** | Automatically set by the writer for a given column. |
| **Bloom Filter:** | Apply a bloom filter to specific columns or all columns (including those within groups). |
| **Writer Version:** | Specify writer version (`1.0`, `2.0`). |
| **Customizable Sizes:** | Row group and page sizes. |

---

Expand Down
2 changes: 1 addition & 1 deletion parquetify/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

<groupId>org.altinity.parquet.regression</groupId>
<artifactId>parquet-regression</artifactId>
<version>1.0.9</version>
<version>1.1.1</version>

<properties>
<maven.compiler.source>11</maven.compiler.source>
Expand Down
9 changes: 9 additions & 0 deletions parquetify/src/main/java/GenerateParquet.java
Original file line number Diff line number Diff line change
Expand Up @@ -132,6 +132,15 @@ private static ParquetWriter<Group> createParquetWriter(String filePath, Configu
configureEncodings(builder, encodings);
configureBloomFilters(builder, bloomFilterOption, options);

if (options.has("extraMetaData")) {
JSONObject extraMetaDataJson = options.getJSONObject("extraMetaData");
Map<String, String> extraMetaData = new HashMap<>();
for (String key : extraMetaDataJson.keySet()) {
extraMetaData.put(key, extraMetaDataJson.getString(key));
}
builder.withExtraMetaData(extraMetaData);
}

if (options.has("encryption")) {
JSONObject encryptionOptions = options.getJSONObject("encryption");
byte[] footerKey = encryptionOptions.getString("footerKey").getBytes(StandardCharsets.UTF_8);
Expand Down
30 changes: 30 additions & 0 deletions parquetify/src/schema-example/json/extra_metadata_entries.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
{
"fileName": "example_with_extra_metadata.parquet",
"options": {
"writerVersion": "1.0",
"compression": "SNAPPY",
"rowGroupSize": 128,
"pageSize": 1024,
"extraMetaData": {
"author": "John Doe",
"description": "Sample Parquet file with extra metadata",
"createdDate": "2021-01-01"
}
},
"schema": [
{
"name": "id",
"schemaType": "required",
"physicalType": "INT32",
"logicalType": "INT8",
"data": [1, 2, 3, 4, 5]
},
{
"name": "name",
"schemaType": "optional",
"physicalType": "BINARY",
"logicalType": "STRING",
"data": ["Alice", "Bob", "Charlie", "David", "Eve"]
}
]
}

0 comments on commit 3d5a45d

Please sign in to comment.