Skip to content

Conversation

@kalavt
Copy link

@kalavt kalavt commented Dec 24, 2025


Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

  • Example configuration file for the change
  • Debug log output from testing the change
  • Attached Valgrind output that shows no leaks or memory corruption was found

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

  • Run local packaging test showing all targets (including any new ones) build.
  • Set ok-package-test label to test for all targets (requires maintainer to do).

Documentation

  • Documentation required for this feature

Backporting

  • Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

Summary by CodeRabbit

  • New Features

    • Parquet output support for S3 (with configurable compression)
    • Snappy and ZSTD compression options added
  • Improvements

    • AVRO encoder enabled by default
    • Kafka SASL/OAuth capabilities improved for broader availability
    • Reduced S3 multipart upload log noise
    • Better libyaml detection on macOS
  • Documentation

    • Added extensive S3 → Parquet/JSON configuration examples and best practices
  • Tests

    • New unit tests validating MsgPack→Parquet conversions and compression variants

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Dec 24, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

Replaces Arrow GLib-based S3 compression with a new C++ Parquet encoder (FLB_PARQUET_ENCODER), adds cmake/parquet.cmake for Arrow/Parquet detection, refactors S3 plugin to use SDS buffers and Parquet format, removes old Arrow compression code, and adds tests and examples.

Changes

Cohort / File(s) Summary
Build & CMake
CMakeLists.txt, cmake/parquet.cmake, cmake/kafka.cmake, src/CMakeLists.txt, src/aws/CMakeLists.txt, lib/avro/CMakeLists.txt, lib/jansson-e23f558/CMakeLists.txt, .github/workflows/unit-tests.yaml, tests/internal/CMakeLists.txt
Add FLB_PARQUET_ENCODER option (remove FLB_ARROW), add cmake/parquet.cmake for cross-platform Arrow/Parquet discovery, wire ARROW/PARQUET libraries into builds/tests, enableFLB_AVRO_ENCODER in FLB_ALL, update Kafka SASL defaults, bump CMake mins, and update CI workflow.
Parquet Encoder Implementation
include/fluent-bit/flb_parquet.h, src/flb_parquet_impl.cpp
New public API flb_msgpack_raw_to_parquet() and C++ implementation: JSON schema parsing, Arrow schema/builders, MsgPack→Arrow conversion, Parquet write with GZIP/SNAPPY/ZSTD, memory handling and error reporting.
S3 Plugin & Upload Flow
plugins/out_s3/s3.c, plugins/out_s3/s3.h, plugins/out_s3/s3_multipart.c
Migrate to flb_sds_t buffers, add format and schema_str to flb_s3, support JSON/Parquet flows, change many upload-related signatures (e.g., s3_put_object), unify upload path, adjust multipart logging and upload defaults/macros.
Compression / AWS integration
include/fluent-bit/aws/flb_aws_compress.h, src/aws/flb_aws_compress.c, plugins/out_kinesis_firehose/firehose.c, src/aws/CMakeLists.txt, src/aws/compression/*
Remove Arrow/Parquet as compression algorithms (deprecated aliases), add true algorithms SNAPPY and ZSTD, add snappy wrapper, remove flb-aws-arrow targets and linking, update Firehose help text.
Removed Arrow GLib implementation
src/aws/compression/arrow/CMakeLists.txt, src/aws/compression/arrow/compress.c, src/aws/compression/arrow/compress.h
Delete GLib-based Arrow/Parquet compress implementation and associated CMake targets and headers.
Tests & Examples
tests/internal/parquet.c, tests/runtime/out_s3.c, examples/s3_output/*
Add comprehensive MsgPack→Parquet unit tests, adjust runtime out_s3 tests for upload modes/size, and add S3 Parquet example configurations and README documentation.

Sequence Diagram(s)

sequenceDiagram
    actor Core as Fluent Bit Core
    participant S3 as S3 Plugin
    participant FMT as Format/Buffer Module
    participant PE as Parquet Encoder (Arrow/Parquet)
    participant S3API as AWS S3

    Core->>S3: Flush (MsgPack payload)
    S3->>S3: Determine format (json | parquet)
    alt parquet
        S3->>FMT: construct_request_buffer(new_data) -> SDS
        FMT->>PE: flb_msgpack_raw_to_parquet(MsgPack, schema, compression)
        Note right of PE: Parse schema → build Arrow arrays → write Parquet
        PE-->>FMT: Parquet buffer (ptr, size)
        FMT-->>S3: SDS buffer containing Parquet data
    else json
        S3->>FMT: construct_request_buffer(new_data) -> SDS(JSON)
        FMT-->>S3: SDS buffer (JSON)
    end
    S3->>S3: Choose upload path (PutObject | Multipart)
    S3->>S3API: Upload request (body = SDS buffer)
    S3API-->>S3: HTTP response
    S3->>Core: Flush complete
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • edsiper
  • cosmo0920
  • patrick-stephens
  • fujimotos
  • niedbalski

Poem

🐰 A tiny rabbit hops with glee,

MsgPack turns Parquet under a tree,
SDS nests keep memory light,
Arrow builders hum through the night,
Off with old GLib — hop to S3!

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 58.75% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 's3_output: add parquet format support' directly reflects the main change: adding Parquet format support to the S3 output plugin.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@kalavt kalavt force-pushed the feature/aws-s3-parquet branch from b4c0a26 to a821fec Compare December 24, 2025 14:01
@kalavt kalavt changed the title Refactor AWS S3 parquet s3_output: add parquet file support Dec 24, 2025
@kalavt kalavt changed the title s3_output: add parquet file support s3_output: add parquet format support Dec 24, 2025
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (10)
src/CMakeLists.txt (1)

525-528: Clarify SASL library linking scope.

The comments indicate "KAFKA SPECIFIC" for SASL library linking, but the linking appears in both the shared library and binary targets. This suggests it's a general requirement when FLB_SASL_ENABLED is true, not specific to this Parquet PR.

Consider updating the comment to be more accurate:

🔎 Suggested comment improvement
-  # KAFKA SPECIFIC - Link SASL library for shared library
+  # Link SASL library when SASL support is enabled (used by Kafka and other plugins)
   if(FLB_SASL_ENABLED)
     target_link_libraries(fluent-bit-shared sasl2)
   endif()
-  # KAFKA SPECIFIC - Link SASL library for binary
+  # Link SASL library when SASL support is enabled (used by Kafka and other plugins)
   if(FLB_SASL_ENABLED)
     target_link_libraries(fluent-bit-bin sasl2)
   endif()

Also applies to: 600-603

examples/s3_output/run-s3-parquet-test.sh (1)

38-39: Add error checking for fluent-bit binary and configuration.

The script assumes the Fluent Bit binary and configuration file exist without verification, which could lead to confusing errors.

🔎 Proposed pre-flight checks
+# Verify fluent-bit binary exists
+if [ ! -f ./build/bin/fluent-bit ]; then
+  echo "Error: fluent-bit binary not found at ./build/bin/fluent-bit"
+  echo "Please build Fluent Bit first:"
+  echo "  cd ../../build && cmake .. && make"
+  exit 1
+fi
+
+# Verify configuration file exists
+if [ ! -f fluent-bit-s3-parquet.conf ]; then
+  echo "Error: Configuration file fluent-bit-s3-parquet.conf not found"
+  exit 1
+fi
+
+# Check AWS credentials
+if [ -z "$AWS_ACCESS_KEY_ID" ] || [ -z "$AWS_SECRET_ACCESS_KEY" ]; then
+  echo "Warning: AWS credentials not set. Fluent Bit may fail to upload to S3."
+  echo "Set credentials with: export AWS_ACCESS_KEY_ID=... AWS_SECRET_ACCESS_KEY=..."
+  echo ""
+fi
+
 # Run Fluent Bit
 ./build/bin/fluent-bit -c fluent-bit-s3-parquet.conf
examples/s3_output/README.md (1)

112-114: Consider adding language identifiers to fenced code blocks.

Static analysis detected some fenced code blocks without language specifiers, which improves syntax highlighting and readability.

🔎 Minor markdown improvements
-```
+```text
 Upload Trigger = upload_timeout reached OR total_file_size reached

```diff
-```
+```text
 s3://s3-bucket-kafka-sink/fluent-bit-logs/
 ├── putobject/app/day=YYYYMMDD/      # Case 1: Small files via PutObject
 └── multipart/access/day=YYYYMMDD/   # Case 2: Large files via Multipart Upload

```diff
-```
+```text
 [parquet] Data quality summary for 53206 records:
 [parquet] Missing non-nullable fields (defaults used):
 [parquet]   field='level' count=52506
 [parquet]   field='timestamp' count=52506
</details>


Also applies to: 177-179, 270-272

</blockquote></details>
<details>
<summary>examples/s3_output/athena-queries.sql (2)</summary><blockquote>

`38-45`: **Use placeholder bucket names in example documentation.**

The hardcoded bucket name `s3-bucket-kafka-sink` appears throughout the file. Consider using a placeholder like `<your-bucket-name>` or `my-fluent-bit-bucket` to make it clearer that users should substitute their own bucket name.

---

`99-99`: **Document the placeholder date format.**

The hardcoded date `20251224` is used in all example queries. Consider adding a note at the top of the query section (around line 83) explaining that users should replace this with their actual date, similar to the note at line 346.

</blockquote></details>
<details>
<summary>cmake/parquet.cmake (2)</summary><blockquote>

`151-163`: **Consider allowing default paths as fallback.**

Using `NO_DEFAULT_PATH` exclusively may miss Arrow/Parquet installations in standard system locations not explicitly listed. Consider removing `NO_DEFAULT_PATH` or adding a second search without it as a fallback to improve detection success on systems with non-standard installations.


<details>
<summary>🔎 Suggested approach</summary>

```diff
   find_path(ARROW_INCLUDE_DIR
     NAMES arrow/api.h
     PATHS ${SEARCH_PATHS}
     PATH_SUFFIXES ${INCLUDE_SUFFIXES} include/arrow
-    NO_DEFAULT_PATH
   )

Or add a fallback search without NO_DEFAULT_PATH if the first search fails.


223-226: Unicode characters in CMake messages may cause issues.

The checkmark (✓) and cross (✗) Unicode characters in status messages may not render correctly on all terminals or CI systems. Consider using ASCII alternatives like [OK] and [MISSING].

🔎 Proposed fix
 if(ARROW_FOUND AND PARQUET_FOUND)
-  message(STATUS "✓ Arrow found: ${ARROW_LIBRARIES}")
+  message(STATUS "[OK] Arrow found: ${ARROW_LIBRARIES}")
   message(STATUS "  Include dirs: ${ARROW_INCLUDE_DIRS}")
-  message(STATUS "✓ Parquet found: ${PARQUET_LIBRARIES}")
+  message(STATUS "[OK] Parquet found: ${PARQUET_LIBRARIES}")
   message(STATUS "  Include dirs: ${PARQUET_INCLUDE_DIRS}")
 else()
-  message(STATUS "✗ Arrow/Parquet not found")
+  message(STATUS "[MISSING] Arrow/Parquet not found")
examples/s3_output/fluent-bit-s3-parquet.conf (1)

145-145: Very long schema_str line reduces readability.

The schema_str value is a single long JSON line that's difficult to read and maintain. Consider adding a comment above it with a formatted version of the schema for documentation purposes, or noting in the README that multi-line values would need escaping.

src/flb_parquet_impl.cpp (1)

250-251: Appending nullptr to BinaryBuilder may cause undefined behavior.

The append_default_value for BINARY type passes nullptr with size 0. While this may work, it's safer to use an empty byte array explicitly.

🔎 Proposed fix
         case arrow::Type::BINARY:
-            return static_cast<arrow::BinaryBuilder*>(builder)->Append(static_cast<const uint8_t*>(nullptr), 0).ok();
+            return static_cast<arrow::BinaryBuilder*>(builder)->Append(reinterpret_cast<const uint8_t*>(""), 0).ok();
plugins/out_s3/s3.c (1)

49-56: Duplicate forward declarations.

These forward declarations at lines 49-56 duplicate those at lines 86-91. Remove the duplicates to avoid confusion.

🔎 Proposed fix
-/* Forward declarations */
-static struct multipart_upload *get_upload(struct flb_s3 *ctx,
-                                           const char *tag, int tag_len);
-static struct multipart_upload *create_upload(struct flb_s3 *ctx,
-                                              const char *tag, int tag_len,
-                                              time_t file_first_log_time);
-static flb_sds_t flb_pack_msgpack_extract_log_key(void *out_context, const char *data,
-                                                  uint64_t bytes, struct flb_config *config);
-
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7b0c1ae and b4c0a26.

📒 Files selected for processing (19)
  • .github/workflows/unit-tests.yaml
  • CMakeLists.txt
  • cmake/parquet.cmake
  • examples/s3_output/README.md
  • examples/s3_output/athena-queries.sql
  • examples/s3_output/fluent-bit-s3-parquet.conf
  • examples/s3_output/run-s3-parquet-test.sh
  • include/fluent-bit/aws/flb_aws_compress.h
  • include/fluent-bit/flb_parquet.h
  • plugins/out_s3/s3.c
  • plugins/out_s3/s3.h
  • src/CMakeLists.txt
  • src/aws/CMakeLists.txt
  • src/aws/compression/CMakeLists.txt
  • src/aws/compression/arrow/CMakeLists.txt
  • src/aws/compression/arrow/compress.c
  • src/aws/compression/arrow/compress.h
  • src/aws/flb_aws_compress.c
  • src/flb_parquet_impl.cpp
💤 Files with no reviewable changes (5)
  • src/aws/compression/CMakeLists.txt
  • src/aws/compression/arrow/compress.c
  • src/aws/compression/arrow/compress.h
  • src/aws/CMakeLists.txt
  • src/aws/compression/arrow/CMakeLists.txt
🧰 Additional context used
🧠 Learnings (16)
📓 Common learnings
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:93-107
Timestamp: 2025-08-29T06:25:27.250Z
Learning: In Fluent Bit, ZSTD compression is enabled by default and is treated as a core dependency, not requiring conditional compilation guards like `#ifdef FLB_HAVE_ZSTD`. Unlike some other optional components such as ARROW/PARQUET (which use `#ifdef FLB_HAVE_ARROW` guards), ZSTD support is always available and doesn't need build-time conditionals. ZSTD headers are included directly without guards across multiple plugins and core components.
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: src/aws/flb_aws_compress.c:52-56
Timestamp: 2025-08-29T06:24:55.855Z
Learning: ZSTD compression is always available in Fluent Bit and does not require conditional compilation guards. Unlike Arrow/Parquet which use #ifdef FLB_HAVE_ARROW guards, ZSTD is built unconditionally with flb_zstd.c included directly in src/CMakeLists.txt and a bundled ZSTD library at lib/zstd-1.5.7/.
📚 Learning: 2025-11-21T06:23:29.770Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 11171
File: include/fluent-bit/flb_lib.h:52-53
Timestamp: 2025-11-21T06:23:29.770Z
Learning: In Fluent Bit core (fluent/fluent-bit repository), function descriptions/documentation are not required for newly added functions in header files.

Applied to files:

  • examples/s3_output/README.md
  • include/fluent-bit/flb_parquet.h
📚 Learning: 2025-08-07T10:15:46.187Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 10691
File: .github/workflows/unit-tests.yaml:94-101
Timestamp: 2025-08-07T10:15:46.187Z
Learning: Apache Arrow latest packages have changing checksums, making hardcoded checksum verification impractical for CI workflows that download the latest version.

Applied to files:

  • .github/workflows/unit-tests.yaml
  • cmake/parquet.cmake
📚 Learning: 2025-08-31T12:46:11.940Z
Learnt from: ThomasDevoogdt
Repo: fluent/fluent-bit PR: 9277
File: .github/workflows/pr-compile-check.yaml:147-151
Timestamp: 2025-08-31T12:46:11.940Z
Learning: In fluent-bit CMakeLists.txt, the system library preference flags are defined as FLB_PREFER_SYSTEM_LIB_ZSTD and FLB_PREFER_SYSTEM_LIB_KAFKA with the FLB_ prefix.

Applied to files:

  • .github/workflows/unit-tests.yaml
  • src/CMakeLists.txt
  • cmake/parquet.cmake
  • CMakeLists.txt
📚 Learning: 2025-08-31T12:46:11.940Z
Learnt from: ThomasDevoogdt
Repo: fluent/fluent-bit PR: 9277
File: .github/workflows/pr-compile-check.yaml:147-151
Timestamp: 2025-08-31T12:46:11.940Z
Learning: In fluent-bit, the correct CMake flag for using system librdkafka is `FLB_PREFER_SYSTEM_LIB_KAFKA=ON`.

Applied to files:

  • .github/workflows/unit-tests.yaml
  • src/CMakeLists.txt
  • cmake/parquet.cmake
  • CMakeLists.txt
📚 Learning: 2025-08-29T06:25:27.250Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:93-107
Timestamp: 2025-08-29T06:25:27.250Z
Learning: In Fluent Bit, ZSTD compression is enabled by default and is treated as a core dependency, not requiring conditional compilation guards like `#ifdef FLB_HAVE_ZSTD`. Unlike some other optional components such as ARROW/PARQUET (which use `#ifdef FLB_HAVE_ARROW` guards), ZSTD support is always available and doesn't need build-time conditionals. ZSTD headers are included directly without guards across multiple plugins and core components.

Applied to files:

  • .github/workflows/unit-tests.yaml
  • src/CMakeLists.txt
  • src/aws/flb_aws_compress.c
  • CMakeLists.txt
  • plugins/out_s3/s3.c
  • include/fluent-bit/aws/flb_aws_compress.h
📚 Learning: 2025-08-29T06:24:55.855Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: src/aws/flb_aws_compress.c:52-56
Timestamp: 2025-08-29T06:24:55.855Z
Learning: ZSTD compression is always available in Fluent Bit and does not require conditional compilation guards. Unlike Arrow/Parquet which use #ifdef FLB_HAVE_ARROW guards, ZSTD is built unconditionally with flb_zstd.c included directly in src/CMakeLists.txt and a bundled ZSTD library at lib/zstd-1.5.7/.

Applied to files:

  • .github/workflows/unit-tests.yaml
  • src/CMakeLists.txt
  • src/aws/flb_aws_compress.c
  • CMakeLists.txt
  • plugins/out_s3/s3.c
  • include/fluent-bit/aws/flb_aws_compress.h
📚 Learning: 2025-08-29T06:25:27.250Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:93-107
Timestamp: 2025-08-29T06:25:27.250Z
Learning: In Fluent Bit, ZSTD compression is enabled by default and is treated as a core dependency, not requiring conditional compilation guards like `#ifdef FLB_HAVE_ZSTD`. Unlike some other optional components, ZSTD support is always available and doesn't need build-time conditionals.

Applied to files:

  • .github/workflows/unit-tests.yaml
  • src/CMakeLists.txt
  • src/aws/flb_aws_compress.c
  • CMakeLists.txt
  • plugins/out_s3/s3.c
  • include/fluent-bit/aws/flb_aws_compress.h
📚 Learning: 2025-08-29T06:24:26.170Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:39-42
Timestamp: 2025-08-29T06:24:26.170Z
Learning: In Fluent Bit, ZSTD compression support is enabled by default and does not require conditional compilation guards (like #ifdef FLB_HAVE_ZSTD) around ZSTD-related code declarations and implementations.

Applied to files:

  • .github/workflows/unit-tests.yaml
  • src/CMakeLists.txt
  • src/aws/flb_aws_compress.c
  • CMakeLists.txt
  • plugins/out_s3/s3.c
  • include/fluent-bit/aws/flb_aws_compress.h
📚 Learning: 2025-08-29T06:24:44.797Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: src/aws/flb_aws_compress.c:26-26
Timestamp: 2025-08-29T06:24:44.797Z
Learning: In Fluent Bit, ZSTD support is always available and enabled by default. The build system automatically detects and uses either the system libzstd library or builds the bundled ZSTD version. Unlike other optional dependencies like Arrow which use conditional compilation guards (e.g., FLB_HAVE_ARROW), ZSTD does not require conditional includes or build flags.

Applied to files:

  • .github/workflows/unit-tests.yaml
  • src/CMakeLists.txt
  • src/aws/flb_aws_compress.c
  • CMakeLists.txt
  • include/fluent-bit/aws/flb_aws_compress.h
📚 Learning: 2025-09-08T11:21:33.975Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 10851
File: include/fluent-bit/flb_simd.h:60-66
Timestamp: 2025-09-08T11:21:33.975Z
Learning: Fluent Bit currently only supports MSVC compiler on Windows, so additional compiler compatibility guards may be unnecessary for Windows-specific code paths.

Applied to files:

  • .github/workflows/unit-tests.yaml
  • src/CMakeLists.txt
  • CMakeLists.txt
📚 Learning: 2025-12-22T05:39:02.291Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 11250
File: lib/ripser-1.2.1/Makefile:1-18
Timestamp: 2025-12-22T05:39:02.291Z
Learning: In fluent/fluent-bit, the Makefile at lib/ripser-1.2.1/Makefile is imported from upstream Ripser and is not used in the actual build process. The project uses CMake for building (lib/ripser-1.2.1/CMakeLists.txt), so changes to the imported Makefile are not necessary.

Applied to files:

  • src/CMakeLists.txt
📚 Learning: 2025-08-29T06:25:02.561Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:7-7
Timestamp: 2025-08-29T06:25:02.561Z
Learning: In Fluent Bit, ZSTD (zstandard) compression library is bundled directly in the source tree at `lib/zstd-1.5.7` and is built unconditionally as a static library. Unlike optional external dependencies, ZSTD does not use conditional compilation guards like `FLB_HAVE_ZSTD` and is always available. Headers like `<fluent-bit/flb_zstd.h>` can be included directly without guards.

Applied to files:

  • src/CMakeLists.txt
  • src/aws/flb_aws_compress.c
  • CMakeLists.txt
  • plugins/out_s3/s3.c
  • include/fluent-bit/aws/flb_aws_compress.h
📚 Learning: 2025-09-14T09:46:09.531Z
Learnt from: aminvakil
Repo: fluent/fluent-bit PR: 10844
File: conf/fluent-bit:13-15
Timestamp: 2025-09-14T09:46:09.531Z
Learning: For fluent-bit Debian packaging, /opt/fluent-bit/bin/ is the appropriate installation path since the package may be installed from non-official Debian sources, making /opt compliant with FHS for optional software packages.

Applied to files:

  • src/CMakeLists.txt
📚 Learning: 2025-09-04T07:28:37.083Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 10832
File: src/wasm/CMakeLists.txt:112-131
Timestamp: 2025-09-04T07:28:37.083Z
Learning: In fluent-bit CMake files, the user cosmo0920 prefers treating Git as a command rather than a package, emphasizing that Git is not a pkg-config retrievable package but just a command.

Applied to files:

  • src/CMakeLists.txt
📚 Learning: 2025-09-04T12:35:22.872Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10825
File: plugins/out_s3/s3.c:1339-1344
Timestamp: 2025-09-04T12:35:22.872Z
Learning: In the Fluent Bit S3 plugin, the user prefers to maintain current retry_limit behavior without special handling for FLB_OUT_RETRY_UNLIMITED (-1), as there's no documentation indicating -1 should be used for infinite retries and consistency with current logic is preferred.

Applied to files:

  • plugins/out_s3/s3.c
🧬 Code graph analysis (3)
src/flb_parquet_impl.cpp (1)
include/fluent-bit/flb_mem.h (1)
  • flb_free (126-128)
src/aws/flb_aws_compress.c (2)
src/flb_gzip.c (1)
  • flb_gzip_compress (157-252)
src/flb_zstd.c (1)
  • flb_zstd_compress (33-57)
plugins/out_s3/s3.h (1)
plugins/out_s3/s3.c (1)
  • s3_put_object (1663-1800)
🪛 Cppcheck (2.19.0)
plugins/out_s3/s3.c

[error] 791-791: failed to expand 'flb_plg_error', it is invalid to use a preprocessor directive as macro parameter

(syntaxError)

🪛 LanguageTool
examples/s3_output/README.md

[grammar] ~56-~56: Ensure spelling is correct
Context: ..., 201, 404, 500, 401) - Response times (2ms to 5000ms) - Client information (IP, us...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[grammar] ~465-~465: Ensure spelling is correct
Context: ...les uploaded when reaching 50MB (before 60min timeout) Verification Steps: 1. R...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🪛 markdownlint-cli2 (0.18.1)
examples/s3_output/README.md

112-112: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


177-177: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


270-270: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🔇 Additional comments (14)
include/fluent-bit/flb_parquet.h (1)

82-85: LGTM! Well-documented API with clear ownership semantics.

The function signature and documentation are comprehensive and clear. The API design properly communicates:

  • Memory ownership (caller must free with flb_free)
  • Error handling (NULL on failure)
  • Schema format expectations (Arrow JSON schema)
  • Supported types and examples
include/fluent-bit/aws/flb_aws_compress.h (2)

33-46: Clear deprecation notice and migration guidance.

The documentation clearly separates true compression algorithms from format converters and provides excellent guidance:

  • ARROW marked as deprecated with rationale
  • PARQUET (5) directs users to use format=parquet instead
  • Clear distinction between compression and format conversion

This will help users migrate away from the deprecated patterns.


30-31: No backward compatibility issue—this renumbering is safe.

The compression constants are internal implementation details used only at runtime. Configuration files use string values (compression=gzip, compression=snappy, compression=zstd) which are mapped to numeric constants via flb_aws_compression_get_type(). Since the numeric constant values are never persisted to storage and users never interact with them directly, renumbering from FLB_AWS_COMPRESS_ZSTD=4 to FLB_AWS_COMPRESS_ZSTD=3 does not break backward compatibility.

Likely an incorrect or invalid review comment.

CMakeLists.txt (1)

284-286: LGTM! Consistent encoder enablement pattern.

When FLB_ALL is enabled, both Avro and Parquet encoders are automatically enabled, following the existing pattern for optional features. This provides a consistent user experience.

src/CMakeLists.txt (1)

157-166: LGTM! Proper C++ integration for Parquet implementation.

The Parquet encoder support is correctly integrated:

  • Source file added conditionally based on FLB_PARQUET_ENCODER
  • Explicit CXX language property set for the C++ implementation file
  • Follows the existing pattern for optional features like Avro encoder
src/aws/flb_aws_compress.c (2)

31-37: LGTM! Correct adapter pattern for Snappy compression.

The wrapper function properly adapts flb_snappy_compress (which uses char* pointers) to the standard compression interface (which uses void* pointers). This follows the same pattern as the existing flb_gzip_compress and flb_zstd_compress functions shown in the relevant code snippets.


46-76: Excellent documentation and clean separation of concerns.

The updated comments clearly distinguish:

  1. True compression algorithms (gzip, snappy, zstd)
  2. Deprecated format converters (ARROW, PARQUET)
  3. Guidance to use format=parquet instead of compression=parquet

The removal of Arrow/Parquet compression options from this module aligns with the PR's goal of decoupling format conversion from the compression subsystem.

examples/s3_output/README.md (1)

1-518: Exceptional documentation! Comprehensive and production-ready guidance.

This README provides outstanding documentation that covers:

  • Clear test scenarios with realistic use cases
  • Detailed explanation of upload triggers (upload_timeout vs total_file_size)
  • Schema design best practices with nullable/non-nullable guidance
  • Data quality monitoring and troubleshooting
  • Performance testing methodology
  • Athena integration examples

The documentation will significantly help users understand Parquet encoder usage in production scenarios.

plugins/out_s3/s3.h (1)

234-235: LGTM - API signature updated for SDS-based buffers.

The signature change from char* body to flb_sds_t body aligns with the SDS-based buffering approach used throughout the refactored code, ensuring consistent memory management.

examples/s3_output/fluent-bit-s3-parquet.conf (1)

104-117: LGTM - Well-documented PutObject configuration.

The configuration clearly documents the upload strategy, timeout behavior, and use case. The comments explaining the difference between upload_timeout and total_file_size are helpful for users.

src/flb_parquet_impl.cpp (1)

696-710: LGTM - Exception handling with proper cleanup.

The exception handlers correctly destroy the msgpack unpacked result and free the output buffer if allocated, preventing memory leaks on error paths.

plugins/out_s3/s3.c (3)

752-774: LGTM - Deprecated compression values handled gracefully.

The code properly warns users about deprecated arrow/parquet compression values while still supporting them by mapping to the new format=parquet approach. The build-time guard ensures helpful error messages when Parquet isn't available.


1583-1630: LGTM - Parquet conversion with proper error handling.

The Parquet conversion path validates schema_str is present, handles conversion failures gracefully, and properly manages memory by converting the Parquet buffer to flb_sds_t for consistent lifecycle management.


3951-3962: Storing raw msgpack improves efficiency for Parquet.

Storing raw msgpack data and deferring format conversion to upload time is the right approach for columnar formats like Parquet, which benefit from batching records together for better compression and schema inference.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
plugins/out_s3/s3.c (1)

3701-3725: Leak of chunk SDS in unit_test_flush() path.

In unit_test_flush() you now treat chunk as flb_sds_t, but never destroy it:

s3_store_buffer_put(..., chunk, ...);
ret = construct_request_buffer(ctx, chunk, upload_file, &buffer, &buffer_size);
...
ret = upload_data(..., buffer, ...);
flb_sds_destroy(buffer);
FLB_OUTPUT_RETURN(ret);

construct_request_buffer() only reads from chunk and does not take ownership, so unit_test_flush() should flb_sds_destroy(chunk) once that call returns. As written, every test‑mode flush leaks the SDS buffer.

Suggested minimal fix
 static void unit_test_flush(void *out_context, struct s3_file *upload_file,
                             const char *tag, int tag_len, flb_sds_t chunk,
                             int chunk_size, struct multipart_upload *m_upload_file,
                             time_t file_first_log_time)
 {
     int ret;
     flb_sds_t buffer = NULL;
     size_t buffer_size;
     struct flb_s3 *ctx = out_context;

     s3_store_buffer_put(ctx, upload_file, tag, tag_len,
                         chunk, (size_t) chunk_size, file_first_log_time);
-    ret = construct_request_buffer(ctx, chunk, upload_file, &buffer, &buffer_size);
+    ret = construct_request_buffer(ctx, chunk, upload_file, &buffer, &buffer_size);
+    flb_sds_destroy(chunk);
     if (ret < 0) {
         flb_plg_error(ctx->ins, "Could not construct request buffer for %s",
                       upload_file->file_path);
         FLB_OUTPUT_RETURN(FLB_RETRY);
     }

     ret = upload_data(ctx, upload_file, m_upload_file, buffer, buffer_size, tag, tag_len);
     flb_sds_destroy(buffer);

     FLB_OUTPUT_RETURN(ret);
 }
♻️ Duplicate comments (7)
examples/s3_output/run-s3-parquet-test.sh (1)

6-16: Logic issue: directories created then immediately cleaned.

The script creates directories (lines 8-11) and then immediately removes their contents (line 15). This sequence is confusing and redundant. If the intent is to ensure clean directories exist, reverse the order or use mkdir -p alone (which is idempotent).

🔎 Proposed fix
 # Create temporary directories
 STORE_DIR="/tmp/fluent-bit-test"
+
+# Clean old data first (if exists)
+if [ -d "$STORE_DIR" ]; then
+  rm -rf "$STORE_DIR"
+fi
+
+# Create fresh directories
 mkdir -p "$STORE_DIR/parquet-gzip-simple"
 mkdir -p "$STORE_DIR/json-snappy-complex-low"
 mkdir -p "$STORE_DIR/json-gzip-complex-medium"
 mkdir -p "$STORE_DIR/json-zstd-complex-high"
 echo "✓ Created store directories"
-
-# Clean old data
-rm -rf "$STORE_DIR"/*/
-echo "✓ Cleaned old data"

This also addresses the safety concern from the previous review about rm -rf on a potentially unset variable, by checking directory existence first.

plugins/out_s3/s3.h (1)

123-124: Unused apply_outer_compression field in struct flb_s3.

apply_outer_compression is not referenced anywhere in plugins/out_s3/s3.c (compression behavior is driven by ctx->format != FLB_S3_FORMAT_PARQUET instead), so this field is dead state and adds confusion. Consider removing it or wiring it into the compression path; otherwise future readers might assume it actually controls outer compression.

CMakeLists.txt (1)

202-202: Update Parquet C++ standard from C++11 to C++17.

The Parquet encoder block enables C++ and forces CMAKE_CXX_STANDARD 11, but current Apache Arrow/Parquet C++ releases require at least C++17. Keeping this at 11 will either fail the build or subtly misconfigure Arrow compilation when FLB_PARQUET_ENCODER is ON. Please bump this to 17 (and consider aligning the earlier simdutf CMAKE_CXX_STANDARD as well).

You should double‑check against the Arrow/Parquet version you target (their build docs list the minimum C++ standard) and confirm that setting CMAKE_CXX_STANDARD 17 is required when FLB_PARQUET_ENCODER is enabled.

Also applies to: 278-287, 1311-1316

plugins/out_s3/s3.c (1)

4210-4217: compression config map still writes a string into an int field.

The compression entry is declared as FLB_CONFIG_MAP_STR but its offset points to struct flb_s3::compression (an int):

{
  FLB_CONFIG_MAP_STR, "compression", NULL,
  0, FLB_TRUE, offsetof(struct flb_s3, compression),
  ...
},

During flb_output_config_map_set(), this causes the parser to store a char * into an int slot, corrupting compression and adjacent fields. You now parse compression explicitly via flb_output_get_property() in cb_s3_init(), so this map binding is both unsafe and unnecessary.

Recommend either:

  • Removing the struct offset (FLB_FALSE, 0) and treating compression as a pure property, or
  • Introducing a char *compression_str field and mapping the string there.
src/flb_parquet_impl.cpp (3)

138-151: Add overflow handling for uint64 → int64 in convert_to_int64().

MSGPACK_OBJECT_POSITIVE_INTEGER values are cast directly from uint64_t to int64_t:

if (obj->type == MSGPACK_OBJECT_POSITIVE_INTEGER) {
    return builder->Append(static_cast<int64_t>(obj->via.u64));
}

Values greater than INT64_MAX will silently wrap, corrupting data in the Parquet file. This mirrors the earlier int32 bug you already guarded against.

You should add an overflow check similar to convert_to_int32() and either clamp, append a default, or return an error so the caller can account for it.

Please confirm how Arrow/Parquet is expected to handle out‑of‑range uint64 values (e.g., whether to clamp, treat as error, or promote to unsigned logical type) based on the versions you target, and align this converter with that behavior.


213-226: Timestamp conversion ignores schema time unit.

convert_to_timestamp() always treats the input as seconds and multiplies by 1,000,000 to get microseconds:

return builder->Append(static_cast<int64_t>(obj->via.u64) * 1000000);

However, when parsing the schema you support "unit": "s"|"ms"|"us"|"ns" and build arrow::timestamp(time_unit) accordingly. The converter never looks at that unit, so a schema declaring ms or ns will still receive values scaled as if they were seconds→µs.

You likely want to:

  • Downcast type (or the builder’s type) to arrow::TimestampType,
  • Inspect its unit(),
  • Apply the appropriate multiplier/divider so the stored values actually match the declared unit.

Double‑check the Arrow TimestampType semantics for units (SECOND, MILLI, MICRO, NANO) and ensure the scaling here matches how your upstream logs encode timestamps (seconds vs ms vs ns).

Also applies to: 386-424, 505-555


301-321: Storing msgpack_object instances across msgpack_unpack_next() iterations may rely on msgpack zone semantics.

You collect msgpack_object instances into a std::vector and process them after the unpack loop:

std::vector<msgpack_object> records;
...
while (msgpack_unpack_next(&result, ...) == MSGPACK_UNPACK_SUCCESS) {
    ...
    records.push_back(array.ptr[i]);    // or result.data
}
...
for (auto &record : records) {
    ...
}

These msgpack_objects contain pointers into the msgpack_unpacked zone. This is fine only if msgpack_unpack_next() never invalidates existing zone allocations while reusing the same msgpack_unpacked across iterations; some msgpack documentation/patterns assume you consume each object immediately.

If the zone is reused or compacted between iterations, records could end up with dangling pointers once more data is unpacked.

Please verify against the msgpack‑c docs for msgpack_unpacked and msgpack_unpack_next() whether it is safe to keep msgpack_object copies after subsequent msgpack_unpack_next() calls, or whether you must either:

  • Process each record inside the loop, or
  • Deep‑copy the record into your own storage (including map/array/string contents) before the next unpack.

Also applies to: 474-555

🧹 Nitpick comments (9)
examples/s3_output/run-s3-parquet-test.sh (1)

38-39: Relative paths may cause execution failures.

The script uses relative paths for both the binary (./build/bin/fluent-bit) and config file (fluent-bit-s3-parquet.conf). This will fail unless run from the repository root.

Consider adding a working directory check or using $SCRIPT_DIR to make paths relative to the script location.

🔎 Proposed improvement
+# Get script directory for relative paths
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+REPO_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
+
 # Run Fluent Bit
-./build/bin/fluent-bit -c fluent-bit-s3-parquet.conf
+"$REPO_ROOT/build/bin/fluent-bit" -c "$SCRIPT_DIR/fluent-bit-s3-parquet.conf"
examples/s3_output/athena-queries.sql (2)

38-44: Hardcoded bucket name - consider adding placeholder instructions.

The S3 bucket s3-bucket-kafka-sink is hardcoded throughout. While acceptable for examples, consider adding a comment at the top noting that users should replace this with their own bucket name.


99-101: Hardcoded date filter in example queries.

All queries use day='20251224' which matches "today's date" per the context. For a reusable example, consider using a dynamic date or adding a note about updating this value.

examples/s3_output/README.md (3)

112-114: Add language specifier to fenced code block.

Per markdownlint, fenced code blocks should have a language specified. This block shows a logical expression.

🔎 Proposed fix
-```
+```text
 Upload Trigger = upload_timeout reached OR total_file_size reached
</details>

---

`177-181`: **Add language specifier to fenced code block.**

Per markdownlint, fenced code blocks should have a language specified for the directory tree.

<details>
<summary>🔎 Proposed fix</summary>

```diff
-```
+```text
 s3://s3-bucket-kafka-sink/fluent-bit-logs/
 ├── putobject/app/day=YYYYMMDD/      # Case 1: Small files via PutObject
 └── multipart/access/day=YYYYMMDD/   # Case 2: Large files via Multipart Upload
</details>

---

`270-275`: **Add language specifier to fenced code block.**

Per markdownlint, fenced code blocks should have a language specified for log output examples.

<details>
<summary>🔎 Proposed fix</summary>

```diff
-```
+```text
 [parquet] Data quality summary for 53206 records:
 [parquet] Missing non-nullable fields (defaults used):
 [parquet]   field='level' count=52506
 [parquet]   field='timestamp' count=52506
</details>

</blockquote></details>
<details>
<summary>cmake/parquet.cmake (1)</summary><blockquote>

`150-163`: **Consider adding default path fallback for Arrow discovery.**

The `NO_DEFAULT_PATH` option prevents CMake from searching standard system paths. This might miss Arrow installations in non-standard but common locations not listed in `SEARCH_PATHS`. Consider removing `NO_DEFAULT_PATH` or adding a secondary search without it as a fallback.

<details>
<summary>🔎 Proposed improvement</summary>

```diff
     find_path(ARROW_INCLUDE_DIR
       NAMES arrow/api.h
       PATHS ${SEARCH_PATHS}
       PATH_SUFFIXES ${INCLUDE_SUFFIXES} include/arrow
-      NO_DEFAULT_PATH
     )

     find_library(ARROW_LIBRARY
       NAMES ${ARROW_LIB_NAMES}
       PATHS ${SEARCH_PATHS}
       PATH_SUFFIXES ${LIB_SUFFIXES}
-      NO_DEFAULT_PATH
     )

Alternatively, keep NO_DEFAULT_PATH for the first search and add a fallback search without it if the first fails.

include/fluent-bit/flb_parquet.h (2)

23-25: Consider removing unused include.

The header includes <fluent-bit/flb_sds.h> but the function signature doesn't use flb_sds_t. If this is not used elsewhere in the implementation exposed by this header, consider removing it to reduce compilation dependencies.

#!/bin/bash
# Check if flb_sds types are used in the parquet implementation or header
rg -n "flb_sds" src/flb_parquet_impl.cpp include/fluent-bit/flb_parquet.h

82-85: Missing extern "C" guard for C++ compatibility.

Since this header will be included by C++ code (flb_parquet_impl.cpp), consider adding extern "C" guards to ensure proper C linkage when compiled as C++.

🔎 Proposed fix
+#ifdef __cplusplus
+extern "C" {
+#endif
+
 void *flb_msgpack_raw_to_parquet(const void *in_buf, size_t in_size,
                                   const char *schema_str,
                                   int compression,
                                   size_t *out_size);

+#ifdef __cplusplus
+}
+#endif
+
 #endif
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b4c0a26 and a821fec.

📒 Files selected for processing (19)
  • .github/workflows/unit-tests.yaml
  • CMakeLists.txt
  • cmake/parquet.cmake
  • examples/s3_output/README.md
  • examples/s3_output/athena-queries.sql
  • examples/s3_output/fluent-bit-s3-parquet.conf
  • examples/s3_output/run-s3-parquet-test.sh
  • include/fluent-bit/aws/flb_aws_compress.h
  • include/fluent-bit/flb_parquet.h
  • plugins/out_s3/s3.c
  • plugins/out_s3/s3.h
  • src/CMakeLists.txt
  • src/aws/CMakeLists.txt
  • src/aws/compression/CMakeLists.txt
  • src/aws/compression/arrow/CMakeLists.txt
  • src/aws/compression/arrow/compress.c
  • src/aws/compression/arrow/compress.h
  • src/aws/flb_aws_compress.c
  • src/flb_parquet_impl.cpp
💤 Files with no reviewable changes (5)
  • src/aws/compression/arrow/compress.h
  • src/aws/compression/arrow/CMakeLists.txt
  • src/aws/compression/CMakeLists.txt
  • src/aws/CMakeLists.txt
  • src/aws/compression/arrow/compress.c
🚧 Files skipped from review as they are similar to previous changes (2)
  • .github/workflows/unit-tests.yaml
  • examples/s3_output/fluent-bit-s3-parquet.conf
🧰 Additional context used
🧠 Learnings (15)
📚 Learning: 2025-08-29T06:25:27.250Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:93-107
Timestamp: 2025-08-29T06:25:27.250Z
Learning: In Fluent Bit, ZSTD compression is enabled by default and is treated as a core dependency, not requiring conditional compilation guards like `#ifdef FLB_HAVE_ZSTD`. Unlike some other optional components such as ARROW/PARQUET (which use `#ifdef FLB_HAVE_ARROW` guards), ZSTD support is always available and doesn't need build-time conditionals. ZSTD headers are included directly without guards across multiple plugins and core components.

Applied to files:

  • src/aws/flb_aws_compress.c
  • src/CMakeLists.txt
  • CMakeLists.txt
  • plugins/out_s3/s3.c
  • include/fluent-bit/aws/flb_aws_compress.h
📚 Learning: 2025-08-29T06:24:55.855Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: src/aws/flb_aws_compress.c:52-56
Timestamp: 2025-08-29T06:24:55.855Z
Learning: ZSTD compression is always available in Fluent Bit and does not require conditional compilation guards. Unlike Arrow/Parquet which use #ifdef FLB_HAVE_ARROW guards, ZSTD is built unconditionally with flb_zstd.c included directly in src/CMakeLists.txt and a bundled ZSTD library at lib/zstd-1.5.7/.

Applied to files:

  • src/aws/flb_aws_compress.c
  • src/CMakeLists.txt
  • CMakeLists.txt
  • src/flb_parquet_impl.cpp
  • plugins/out_s3/s3.c
  • include/fluent-bit/aws/flb_aws_compress.h
📚 Learning: 2025-08-29T06:24:26.170Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:39-42
Timestamp: 2025-08-29T06:24:26.170Z
Learning: In Fluent Bit, ZSTD compression support is enabled by default and does not require conditional compilation guards (like #ifdef FLB_HAVE_ZSTD) around ZSTD-related code declarations and implementations.

Applied to files:

  • src/aws/flb_aws_compress.c
  • src/CMakeLists.txt
  • CMakeLists.txt
  • plugins/out_s3/s3.c
  • include/fluent-bit/aws/flb_aws_compress.h
📚 Learning: 2025-08-29T06:25:02.561Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:7-7
Timestamp: 2025-08-29T06:25:02.561Z
Learning: In Fluent Bit, ZSTD (zstandard) compression library is bundled directly in the source tree at `lib/zstd-1.5.7` and is built unconditionally as a static library. Unlike optional external dependencies, ZSTD does not use conditional compilation guards like `FLB_HAVE_ZSTD` and is always available. Headers like `<fluent-bit/flb_zstd.h>` can be included directly without guards.

Applied to files:

  • src/aws/flb_aws_compress.c
  • src/CMakeLists.txt
  • CMakeLists.txt
  • plugins/out_s3/s3.c
  • include/fluent-bit/aws/flb_aws_compress.h
📚 Learning: 2025-08-29T06:25:27.250Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:93-107
Timestamp: 2025-08-29T06:25:27.250Z
Learning: In Fluent Bit, ZSTD compression is enabled by default and is treated as a core dependency, not requiring conditional compilation guards like `#ifdef FLB_HAVE_ZSTD`. Unlike some other optional components, ZSTD support is always available and doesn't need build-time conditionals.

Applied to files:

  • src/aws/flb_aws_compress.c
  • src/CMakeLists.txt
  • CMakeLists.txt
  • plugins/out_s3/s3.c
  • include/fluent-bit/aws/flb_aws_compress.h
📚 Learning: 2025-08-29T06:24:44.797Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: src/aws/flb_aws_compress.c:26-26
Timestamp: 2025-08-29T06:24:44.797Z
Learning: In Fluent Bit, ZSTD support is always available and enabled by default. The build system automatically detects and uses either the system libzstd library or builds the bundled ZSTD version. Unlike other optional dependencies like Arrow which use conditional compilation guards (e.g., FLB_HAVE_ARROW), ZSTD does not require conditional includes or build flags.

Applied to files:

  • src/aws/flb_aws_compress.c
  • src/CMakeLists.txt
  • CMakeLists.txt
📚 Learning: 2025-08-31T12:46:11.940Z
Learnt from: ThomasDevoogdt
Repo: fluent/fluent-bit PR: 9277
File: .github/workflows/pr-compile-check.yaml:147-151
Timestamp: 2025-08-31T12:46:11.940Z
Learning: In fluent-bit CMakeLists.txt, the system library preference flags are defined as FLB_PREFER_SYSTEM_LIB_ZSTD and FLB_PREFER_SYSTEM_LIB_KAFKA with the FLB_ prefix.

Applied to files:

  • src/CMakeLists.txt
  • CMakeLists.txt
  • cmake/parquet.cmake
📚 Learning: 2025-08-31T12:46:11.940Z
Learnt from: ThomasDevoogdt
Repo: fluent/fluent-bit PR: 9277
File: .github/workflows/pr-compile-check.yaml:147-151
Timestamp: 2025-08-31T12:46:11.940Z
Learning: In fluent-bit, the correct CMake flag for using system librdkafka is `FLB_PREFER_SYSTEM_LIB_KAFKA=ON`.

Applied to files:

  • src/CMakeLists.txt
  • CMakeLists.txt
  • cmake/parquet.cmake
📚 Learning: 2025-12-22T05:39:02.291Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 11250
File: lib/ripser-1.2.1/Makefile:1-18
Timestamp: 2025-12-22T05:39:02.291Z
Learning: In fluent/fluent-bit, the Makefile at lib/ripser-1.2.1/Makefile is imported from upstream Ripser and is not used in the actual build process. The project uses CMake for building (lib/ripser-1.2.1/CMakeLists.txt), so changes to the imported Makefile are not necessary.

Applied to files:

  • src/CMakeLists.txt
📚 Learning: 2025-09-08T11:21:33.975Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 10851
File: include/fluent-bit/flb_simd.h:60-66
Timestamp: 2025-09-08T11:21:33.975Z
Learning: Fluent Bit currently only supports MSVC compiler on Windows, so additional compiler compatibility guards may be unnecessary for Windows-specific code paths.

Applied to files:

  • src/CMakeLists.txt
  • CMakeLists.txt
📚 Learning: 2025-09-14T09:46:09.531Z
Learnt from: aminvakil
Repo: fluent/fluent-bit PR: 10844
File: conf/fluent-bit:13-15
Timestamp: 2025-09-14T09:46:09.531Z
Learning: For fluent-bit Debian packaging, /opt/fluent-bit/bin/ is the appropriate installation path since the package may be installed from non-official Debian sources, making /opt compliant with FHS for optional software packages.

Applied to files:

  • src/CMakeLists.txt
📚 Learning: 2025-09-04T07:28:37.083Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 10832
File: src/wasm/CMakeLists.txt:112-131
Timestamp: 2025-09-04T07:28:37.083Z
Learning: In fluent-bit CMake files, the user cosmo0920 prefers treating Git as a command rather than a package, emphasizing that Git is not a pkg-config retrievable package but just a command.

Applied to files:

  • src/CMakeLists.txt
📚 Learning: 2025-11-21T06:23:29.770Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 11171
File: include/fluent-bit/flb_lib.h:52-53
Timestamp: 2025-11-21T06:23:29.770Z
Learning: In Fluent Bit core (fluent/fluent-bit repository), function descriptions/documentation are not required for newly added functions in header files.

Applied to files:

  • include/fluent-bit/flb_parquet.h
  • examples/s3_output/README.md
📚 Learning: 2025-09-04T12:35:22.872Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10825
File: plugins/out_s3/s3.c:1339-1344
Timestamp: 2025-09-04T12:35:22.872Z
Learning: In the Fluent Bit S3 plugin, the user prefers to maintain current retry_limit behavior without special handling for FLB_OUT_RETRY_UNLIMITED (-1), as there's no documentation indicating -1 should be used for infinite retries and consistency with current logic is preferred.

Applied to files:

  • plugins/out_s3/s3.c
📚 Learning: 2025-08-07T10:15:46.187Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 10691
File: .github/workflows/unit-tests.yaml:94-101
Timestamp: 2025-08-07T10:15:46.187Z
Learning: Apache Arrow latest packages have changing checksums, making hardcoded checksum verification impractical for CI workflows that download the latest version.

Applied to files:

  • cmake/parquet.cmake
🧬 Code graph analysis (4)
src/aws/flb_aws_compress.c (2)
src/flb_gzip.c (1)
  • flb_gzip_compress (157-252)
src/flb_zstd.c (1)
  • flb_zstd_compress (33-57)
src/flb_parquet_impl.cpp (1)
include/fluent-bit/flb_mem.h (1)
  • flb_free (126-128)
include/fluent-bit/flb_parquet.h (1)
src/flb_parquet_impl.cpp (2)
  • flb_msgpack_raw_to_parquet (265-711)
  • flb_msgpack_raw_to_parquet (265-268)
plugins/out_s3/s3.h (1)
plugins/out_s3/s3.c (1)
  • s3_put_object (1663-1800)
🪛 Cppcheck (2.19.0)
plugins/out_s3/s3.c

[error] 791-791: failed to expand 'flb_plg_error', it is invalid to use a preprocessor directive as macro parameter

(syntaxError)

🪛 LanguageTool
examples/s3_output/README.md

[grammar] ~56-~56: Ensure spelling is correct
Context: ..., 201, 404, 500, 401) - Response times (2ms to 5000ms) - Client information (IP, us...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[grammar] ~465-~465: Ensure spelling is correct
Context: ...les uploaded when reaching 50MB (before 60min timeout) Verification Steps: 1. R...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🪛 markdownlint-cli2 (0.18.1)
examples/s3_output/README.md

112-112: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


177-177: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


270-270: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🔇 Additional comments (19)
src/aws/flb_aws_compress.c (2)

31-37: LGTM! Wrapper function correctly adapts the signature.

The wrapper properly casts between void* and char* to adapt flb_snappy_compress's signature to the uniform compression interface. This follows the same pattern used for gzip and zstd.


45-76: Good refactoring to clarify compression algorithms vs format converters.

The updated comments clearly distinguish true compression algorithms (gzip, snappy, zstd) from format converters (Arrow/Parquet). Removing ARROW/PARQUET from the runtime compression_options table while keeping deprecated macros in the header is a clean backward-compatible approach.

One minor observation: the comment on line 59 ("FLB_AWS_COMPRESS_NONE which is 0 is reserved for array footer") is now orphaned after the blank line. Consider moving it closer to the footer entry on line 77.

include/fluent-bit/aws/flb_aws_compress.h (2)

33-46: Good deprecation strategy with clear documentation.

The comments clearly explain that ARROW is deprecated and PARQUET should use format=parquet instead. Keeping the macros prevents compilation errors for any downstream code that references them.


28-31: No backward compatibility issue from ZSTD renumbering.

Compression type configuration uses string keywords ("gzip", "snappy", "zstd"), not numeric values. The numeric constants (0, 1, 2, 3, etc.) are internal implementation details used only at runtime for comparisons and branching. The flb_aws_compression_get_type() function converts configuration strings to these numeric types at startup. Since numeric values are never persisted, serialized, or exposed to external systems, renumbering internal constants has no impact on user configurations or backward compatibility.

cmake/parquet.cmake (1)

222-226: Good: Clear status reporting with emoji indicators.

The status messages with ✓ and ✗ symbols provide clear visual feedback during CMake configuration. The inclusion of library paths and include directories aids debugging.

src/CMakeLists.txt (3)

157-166: LGTM! Parquet encoder integration follows existing patterns.

The Parquet encoder source is conditionally added and marked as C++ correctly. This follows the same pattern as the AVRO encoder block above.


525-528: SASL linking additions appear correct for Kafka integration.

The SASL library is conditionally linked when FLB_SASL_ENABLED is set, properly handling both the shared library and binary targets. This follows standard CMake patterns.

Also applies to: 600-605


398-405: The code is correctly guarded. When if(FLB_PARQUET_ENCODER) at line 399 is true, it means the parent CMakeLists.txt verified that both ARROW_FOUND and PARQUET_FOUND were true (lines 1320-1328), and it would have set FLB_PARQUET_ENCODER OFF otherwise. The variables ARROW_LIBRARIES and PARQUET_LIBRARIES are set and exported with PARENT_SCOPE in cmake/parquet.cmake, ensuring they are available in the subdirectory scope.

include/fluent-bit/flb_parquet.h (1)

27-85: Excellent API documentation with comprehensive schema examples.

The documentation thoroughly covers:

  • Parameter descriptions with format specifications
  • Supported type mappings
  • Usage example with proper memory management
  • Links to Arrow documentation

This level of documentation is exemplary for a public API.

plugins/out_s3/s3.h (1)

197-205: Format fields, macros, and s3_put_object prototype look consistent.

format/schema_str fields, FLB_S3_FORMAT_JSON/PARQUET macros, and the updated s3_put_object prototype (using flb_sds_t buffer) all match their usage in plugins/out_s3/s3.c and cleanly expose JSON vs Parquet behavior.

Also applies to: 234-235

plugins/out_s3/s3.c (8)

192-199: Correctly skipping Content-Encoding for Parquet payloads.

Conditioning the Content-Encoding header on (compression == gzip|zstd) && format != FLB_S3_FORMAT_PARQUET ensures Parquet objects are not wrapped in an outer HTTP compression layer while still advertising compression for JSON payloads. This matches the “internal only” compression model for Parquet and looks good.

Also applies to: 227-239


803-823: Parquet format path: good separation of internal vs outer compression, but hard‑fails without schema_str.

The Parquet branch in construct_request_buffer() correctly:

  • Avoids outer HTTP compression and uses ctx->compression only as the internal Parquet codec.
  • Requires schema_str and bails out early with a clear error if it’s missing.
  • Converts the Arrow buffer to flb_sds_t and frees the Arrow‑allocated memory.

This behavior is sound: format‑specific requirements are enforced once at conversion time, and buffer ownership is cleanly transferred to Fluent Bit.

Also applies to: 1515-1631


1217-1277: SDS compression path in upload_data() is generally correct, but relies solely on format to skip outer compression.

The new upload_data() signature and logic look good overall:

  • body is now flb_sds_t, with body_to_send/size_to_send correctly switched to the compressed SDS when compression succeeds.
  • The compression block is gated by ctx->compression != FLB_AWS_COMPRESS_NONE && ctx->format != FLB_S3_FORMAT_PARQUET, so Parquet payloads are never double‑compressed.
  • All error paths that allocate compressed_body also destroy it, and both PutObject and multipart paths clean up compressed_body after use.

No functional issues seen here; the behavior matches the intended “outer compression only for non‑Parquet” design.

Also applies to: 1325-1392


1442-1513: SDS buffer lifecycle around construct_request_buffer() / upload_data() is consistent (except unit tests).

  • put_all_chunks(), send_upload_request(), and cb_s3_upload() all:
    • Call construct_request_buffer() to obtain an flb_sds_t buffer.
    • Pass buffer into s3_put_object() / upload_data().
    • Always flb_sds_destroy(buffer) afterward, regardless of success/failure.

This gives a single clear owner for each formatted buffer and avoids leaks in the main upload paths.

Also applies to: 1991-2015, 3440-3488, 3701-3724


1515-1661: Error handling and locking in construct_request_buffer() look solid.

  • Handles the illegal (new_data == NULL && chunk == NULL) combination explicitly.
  • For file‑backed chunks, locks the s3_file only after a successful read, and reliably unlocks on every error path (realloc failure, Parquet schema error, conversion failure).
  • Frees the intermediate msgpack buffer in all paths (success and failure).
  • Ensures formatted_data/formatted_size are both valid before returning and destroys the SDS on conversion failure.

This is a good, consistent ownership pattern between the file store, msgpack buffer, and formatted SDS.


1663-1800: s3_put_object() SDS refactor is behavior‑preserving.

Switching the body parameter to flb_sds_t integrates cleanly:

  • MD5 calculation and HTTP request still operate on a contiguous buffer (body, body_size).
  • URI/key construction, seq_index updates, error paths, and logging remain unchanged.
  • No new ownership expectations are imposed on callers (they still manage the SDS lifetime).

3953-3963: Raw msgpack buffering in cb_s3_flush() is a good fit for Parquet.

Creating chunk as an SDS copy of the raw msgpack and deferring JSON/Parquet conversion to construct_request_buffer():

  • Avoids double conversion when retries occur.
  • Makes it easy to plug in additional formats later, since the on‑disk representation stays msgpack.
  • Keeps buffer ownership straightforward via buffer_chunk() / send_upload_request().

No issues spotted here.

Also applies to: 4036-4085


4219-4231: New format, schema_str, upload_timeout, and bearer token options are wired correctly.

  • format is exposed as a documented string option and interpreted via flb_output_get_property("format", ...).
  • schema_str is mapped into struct flb_s3::schema_str and used by the Parquet conversion path.
  • upload_timeout is now a proper FLB_CONFIG_MAP_TIME option and is referenced consistently in timeout checks and timer setup.
  • authorization_endpoint_bearer_token follows the existing auth option patterns.

These mappings and docs align with how the fields are used later in the file.

Also applies to: 4381-4386, 4406-4414

src/flb_parquet_impl.cpp (1)

265-299: Overall Parquet conversion pipeline and memory ownership look good.

  • Parameter validation, compression mapping, schema parsing (via cJSON), and Arrow schema construction are all checked with clear error logs.
  • Builders are created per field, and all failure paths free cJSON resources and destroy the msgpack zone before returning.
  • Record conversion tracks data‑quality stats (missing fields, type failures, int32 overflow, float→int clamps) and emits a concise summary only when issues exist.
  • Parquet writing uses BufferOutputStream (with auto‑growing buffer), closes the writer, then copies the final Arrow buffer into flb_malloc‑managed memory (output_buffer), which is freed on all exception paths.

The high‑level control flow and ownership model are solid; the remaining issues are mainly around numeric edge cases and msgpack object lifetime.

Also applies to: 331-446, 449-611, 616-692

@kalavt kalavt force-pushed the feature/aws-s3-parquet branch from a821fec to 5e868fc Compare December 24, 2025 14:26
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (3)
CMakeLists.txt (1)

1311-1316: Parquet encoder should request C++17, not C++11

Arrow/Parquet C++ now require at least C++17; forcing CMAKE_CXX_STANDARD 11 in the Parquet block will fail with current Arrow toolchains or silently misconfigure builds. Please bump this to 17 (and keep it consistent with any other CMAKE_CXX_STANDARD overrides).

if(FLB_PARQUET_ENCODER)
  enable_language(CXX)
  set(CMAKE_CXX_STANDARD 17)
  include(cmake/parquet.cmake)
  ...
endif()
#!/bin/bash
# Verify Arrow C++ minimum standard in your environment
rg -n "CMAKE_CXX_STANDARD" CMakeLists.txt cmake -S
cmake/parquet.cmake (1)

149-157: Fix Arrow include search (include/arrow suffix still produces wrong include dir)

In the manual search:

find_path(ARROW_INCLUDE_DIR
  NAMES arrow/api.h
  PATHS ${SEARCH_PATHS}
  PATH_SUFFIXES ${INCLUDE_SUFFIXES} include/arrow
  NO_DEFAULT_PATH
)

using include/arrow as a suffix causes ARROW_INCLUDE_DIR to resolve to ${prefix}/include/arrow, so the later version check:

if(EXISTS "${ARROW_INCLUDE_DIRS}/arrow/util/config.h")

looks for ${prefix}/include/arrow/arrow/util/config.h, which is incorrect for standard Arrow layouts (${prefix}/include/arrow/...).

Drop the include/arrow suffix and rely on ${INCLUDE_SUFFIXES} (typically just include), e.g.:

find_path(ARROW_INCLUDE_DIR
  NAMES arrow/api.h
  PATHS ${SEARCH_PATHS}
  PATH_SUFFIXES ${INCLUDE_SUFFIXES}
  NO_DEFAULT_PATH
)

The Parquet search’s include/parquet suffix is fine and should remain.

Also applies to: 198-216

plugins/out_s3/s3.c (1)

744-775: Backward compatibility broken: compression=arrow|parquet now fails instead of mapping to format=parquet

The check at line 752 can never be true. When flb_aws_compression_get_type(tmp) is called with "arrow" or "parquet", it returns -1 because these strings are not in the compression_options table in flb_aws_compress.c (which only includes gzip, snappy, and zstd). This causes the code to hit the "Unknown compression type" error path instead of handling these deprecated aliases.

The constants FLB_AWS_COMPRESS_ARROW (4) and FLB_AWS_COMPRESS_PARQUET (5) are defined but unreachable, contradicting the documented deprecated behavior and the inline comments claiming support.

Handle these strings explicitly before calling flb_aws_compression_get_type:

tmp = flb_output_get_property("compression", ins);
if (tmp) {
    if (strcasecmp(tmp, "arrow") == 0 || strcasecmp(tmp, "parquet") == 0) {
        flb_plg_warn(ctx->ins,
                     "DEPRECATED: compression=%s is deprecated. Use format=parquet instead.", tmp);
#ifdef FLB_HAVE_PARQUET_ENCODER
        ctx->format = FLB_S3_FORMAT_PARQUET;
#else
        flb_plg_error(ctx->ins,
                      "Parquet format is not supported in this build. "
                      "Rebuild with -DFLB_PARQUET_ENCODER=On.");
        return -1;
#endif
    }
    else {
        ret = flb_aws_compression_get_type(tmp);
        if (ret == -1) {
            flb_plg_error(ctx->ins, "Unknown compression type: %s", tmp);
            return -1;
        }
        ctx->compression = ret;
    }
}
🧹 Nitpick comments (3)
plugins/out_s3/s3.c (2)

49-57: Remove duplicate static declarations for get_upload / create_upload

Both functions are declared twice (once near the top as forward declarations and again before their definitions). This is harmless but noisy and can confuse readers; you only need one forward declaration (or none if you reorder definitions).

Also applies to: 86-92


1218-1277: SDS-based outer compression flow is sound, but pre/post compression logging is now misleading

The refactor in upload_data to:

  • Use body_to_send / size_to_send (possibly compressed) while leaving the original body / body_size unchanged, and
  • Wrap the compressed payload in an flb_sds_t (compressed_body) with consistent cleanup,

looks correct and avoids raw malloc/free handling in callers.

However, the informational log in the "too small, using PutObject" branch still prints:

flb_plg_info(ctx->ins,
    "Pre-compression upload_chunk_size= %zu, After compression, chunk is only %zu bytes, ...",
    preCompress_size, body_size);

while the actual payload on the wire is size_to_send, not body_size. For Parquet, this message can also trigger even though no outer compression runs.

Consider:

  • Basing the "after compression" size on size_to_send, and
  • Gating this log on ctx->format != FLB_S3_FORMAT_PARQUET to avoid implying any outer compression there.

Also applies to: 1300-1314, 1330-1392

examples/s3_output/README.md (1)

63-72: Add language identifiers to fenced code blocks for better tooling support

The README content is solid and matches the new S3/Parquet behavior, but several fenced blocks (config snippets, shell commands, JSON schemas, SQL) are missing language tags. Adding them will quiet markdownlint and improve rendering/IDE support, e.g.:

```ini
[OUTPUT]
    Name s3
    ...
cmake ..
make -j8
{ "fields": [ ... ] }
SELECT ...;

Also applies to: 85-95, 144-148, 152-156, 160-164, 178-181, 209-218, 241-251, 296-314, 341-349, 361-365, 384-426, 432-438, 442-456, 460-466, 484-495

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a821fec and 5e868fc.

📒 Files selected for processing (17)
  • .github/workflows/unit-tests.yaml
  • CMakeLists.txt
  • cmake/parquet.cmake
  • examples/s3_output/README.md
  • examples/s3_output/athena-queries.sql
  • examples/s3_output/fluent-bit-s3-parquet.conf
  • examples/s3_output/run-s3-parquet-test.sh
  • include/fluent-bit/aws/flb_aws_compress.h
  • plugins/out_s3/s3.c
  • plugins/out_s3/s3.h
  • src/CMakeLists.txt
  • src/aws/CMakeLists.txt
  • src/aws/compression/CMakeLists.txt
  • src/aws/compression/arrow/CMakeLists.txt
  • src/aws/compression/arrow/compress.c
  • src/aws/compression/arrow/compress.h
  • src/aws/flb_aws_compress.c
💤 Files with no reviewable changes (5)
  • src/aws/compression/arrow/CMakeLists.txt
  • src/aws/compression/arrow/compress.c
  • src/aws/compression/arrow/compress.h
  • src/aws/CMakeLists.txt
  • src/aws/compression/CMakeLists.txt
🚧 Files skipped from review as they are similar to previous changes (4)
  • examples/s3_output/athena-queries.sql
  • examples/s3_output/fluent-bit-s3-parquet.conf
  • examples/s3_output/run-s3-parquet-test.sh
  • .github/workflows/unit-tests.yaml
🧰 Additional context used
🧠 Learnings (16)
📓 Common learnings
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: src/aws/flb_aws_compress.c:52-56
Timestamp: 2025-08-29T06:24:55.855Z
Learning: ZSTD compression is always available in Fluent Bit and does not require conditional compilation guards. Unlike Arrow/Parquet which use #ifdef FLB_HAVE_ARROW guards, ZSTD is built unconditionally with flb_zstd.c included directly in src/CMakeLists.txt and a bundled ZSTD library at lib/zstd-1.5.7/.
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:93-107
Timestamp: 2025-08-29T06:25:27.250Z
Learning: In Fluent Bit, ZSTD compression is enabled by default and is treated as a core dependency, not requiring conditional compilation guards like `#ifdef FLB_HAVE_ZSTD`. Unlike some other optional components such as ARROW/PARQUET (which use `#ifdef FLB_HAVE_ARROW` guards), ZSTD support is always available and doesn't need build-time conditionals. ZSTD headers are included directly without guards across multiple plugins and core components.
📚 Learning: 2025-08-31T12:46:11.940Z
Learnt from: ThomasDevoogdt
Repo: fluent/fluent-bit PR: 9277
File: .github/workflows/pr-compile-check.yaml:147-151
Timestamp: 2025-08-31T12:46:11.940Z
Learning: In fluent-bit CMakeLists.txt, the system library preference flags are defined as FLB_PREFER_SYSTEM_LIB_ZSTD and FLB_PREFER_SYSTEM_LIB_KAFKA with the FLB_ prefix.

Applied to files:

  • src/CMakeLists.txt
  • cmake/parquet.cmake
  • CMakeLists.txt
📚 Learning: 2025-08-31T12:46:11.940Z
Learnt from: ThomasDevoogdt
Repo: fluent/fluent-bit PR: 9277
File: .github/workflows/pr-compile-check.yaml:147-151
Timestamp: 2025-08-31T12:46:11.940Z
Learning: In fluent-bit, the correct CMake flag for using system librdkafka is `FLB_PREFER_SYSTEM_LIB_KAFKA=ON`.

Applied to files:

  • src/CMakeLists.txt
  • cmake/parquet.cmake
  • CMakeLists.txt
📚 Learning: 2025-08-29T06:24:55.855Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: src/aws/flb_aws_compress.c:52-56
Timestamp: 2025-08-29T06:24:55.855Z
Learning: ZSTD compression is always available in Fluent Bit and does not require conditional compilation guards. Unlike Arrow/Parquet which use #ifdef FLB_HAVE_ARROW guards, ZSTD is built unconditionally with flb_zstd.c included directly in src/CMakeLists.txt and a bundled ZSTD library at lib/zstd-1.5.7/.

Applied to files:

  • src/CMakeLists.txt
  • include/fluent-bit/aws/flb_aws_compress.h
  • plugins/out_s3/s3.c
  • CMakeLists.txt
  • src/aws/flb_aws_compress.c
📚 Learning: 2025-12-22T05:39:02.291Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 11250
File: lib/ripser-1.2.1/Makefile:1-18
Timestamp: 2025-12-22T05:39:02.291Z
Learning: In fluent/fluent-bit, the Makefile at lib/ripser-1.2.1/Makefile is imported from upstream Ripser and is not used in the actual build process. The project uses CMake for building (lib/ripser-1.2.1/CMakeLists.txt), so changes to the imported Makefile are not necessary.

Applied to files:

  • src/CMakeLists.txt
📚 Learning: 2025-08-29T06:25:27.250Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:93-107
Timestamp: 2025-08-29T06:25:27.250Z
Learning: In Fluent Bit, ZSTD compression is enabled by default and is treated as a core dependency, not requiring conditional compilation guards like `#ifdef FLB_HAVE_ZSTD`. Unlike some other optional components such as ARROW/PARQUET (which use `#ifdef FLB_HAVE_ARROW` guards), ZSTD support is always available and doesn't need build-time conditionals. ZSTD headers are included directly without guards across multiple plugins and core components.

Applied to files:

  • src/CMakeLists.txt
  • include/fluent-bit/aws/flb_aws_compress.h
  • plugins/out_s3/s3.c
  • CMakeLists.txt
  • src/aws/flb_aws_compress.c
📚 Learning: 2025-09-08T11:21:33.975Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 10851
File: include/fluent-bit/flb_simd.h:60-66
Timestamp: 2025-09-08T11:21:33.975Z
Learning: Fluent Bit currently only supports MSVC compiler on Windows, so additional compiler compatibility guards may be unnecessary for Windows-specific code paths.

Applied to files:

  • src/CMakeLists.txt
  • CMakeLists.txt
📚 Learning: 2025-08-29T06:25:27.250Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:93-107
Timestamp: 2025-08-29T06:25:27.250Z
Learning: In Fluent Bit, ZSTD compression is enabled by default and is treated as a core dependency, not requiring conditional compilation guards like `#ifdef FLB_HAVE_ZSTD`. Unlike some other optional components, ZSTD support is always available and doesn't need build-time conditionals.

Applied to files:

  • src/CMakeLists.txt
  • include/fluent-bit/aws/flb_aws_compress.h
  • plugins/out_s3/s3.c
  • CMakeLists.txt
  • src/aws/flb_aws_compress.c
📚 Learning: 2025-08-29T06:24:26.170Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:39-42
Timestamp: 2025-08-29T06:24:26.170Z
Learning: In Fluent Bit, ZSTD compression support is enabled by default and does not require conditional compilation guards (like #ifdef FLB_HAVE_ZSTD) around ZSTD-related code declarations and implementations.

Applied to files:

  • src/CMakeLists.txt
  • include/fluent-bit/aws/flb_aws_compress.h
  • plugins/out_s3/s3.c
  • CMakeLists.txt
  • src/aws/flb_aws_compress.c
📚 Learning: 2025-08-29T06:25:02.561Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:7-7
Timestamp: 2025-08-29T06:25:02.561Z
Learning: In Fluent Bit, ZSTD (zstandard) compression library is bundled directly in the source tree at `lib/zstd-1.5.7` and is built unconditionally as a static library. Unlike optional external dependencies, ZSTD does not use conditional compilation guards like `FLB_HAVE_ZSTD` and is always available. Headers like `<fluent-bit/flb_zstd.h>` can be included directly without guards.

Applied to files:

  • src/CMakeLists.txt
  • include/fluent-bit/aws/flb_aws_compress.h
  • plugins/out_s3/s3.c
  • CMakeLists.txt
  • src/aws/flb_aws_compress.c
📚 Learning: 2025-08-29T06:24:44.797Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: src/aws/flb_aws_compress.c:26-26
Timestamp: 2025-08-29T06:24:44.797Z
Learning: In Fluent Bit, ZSTD support is always available and enabled by default. The build system automatically detects and uses either the system libzstd library or builds the bundled ZSTD version. Unlike other optional dependencies like Arrow which use conditional compilation guards (e.g., FLB_HAVE_ARROW), ZSTD does not require conditional includes or build flags.

Applied to files:

  • src/CMakeLists.txt
  • include/fluent-bit/aws/flb_aws_compress.h
  • CMakeLists.txt
  • src/aws/flb_aws_compress.c
📚 Learning: 2025-09-14T09:46:09.531Z
Learnt from: aminvakil
Repo: fluent/fluent-bit PR: 10844
File: conf/fluent-bit:13-15
Timestamp: 2025-09-14T09:46:09.531Z
Learning: For fluent-bit Debian packaging, /opt/fluent-bit/bin/ is the appropriate installation path since the package may be installed from non-official Debian sources, making /opt compliant with FHS for optional software packages.

Applied to files:

  • src/CMakeLists.txt
📚 Learning: 2025-09-04T07:28:37.083Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 10832
File: src/wasm/CMakeLists.txt:112-131
Timestamp: 2025-09-04T07:28:37.083Z
Learning: In fluent-bit CMake files, the user cosmo0920 prefers treating Git as a command rather than a package, emphasizing that Git is not a pkg-config retrievable package but just a command.

Applied to files:

  • src/CMakeLists.txt
📚 Learning: 2025-11-21T06:23:29.770Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 11171
File: include/fluent-bit/flb_lib.h:52-53
Timestamp: 2025-11-21T06:23:29.770Z
Learning: In Fluent Bit core (fluent/fluent-bit repository), function descriptions/documentation are not required for newly added functions in header files.

Applied to files:

  • examples/s3_output/README.md
📚 Learning: 2025-08-07T10:15:46.187Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 10691
File: .github/workflows/unit-tests.yaml:94-101
Timestamp: 2025-08-07T10:15:46.187Z
Learning: Apache Arrow latest packages have changing checksums, making hardcoded checksum verification impractical for CI workflows that download the latest version.

Applied to files:

  • cmake/parquet.cmake
📚 Learning: 2025-09-04T12:35:22.872Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10825
File: plugins/out_s3/s3.c:1339-1344
Timestamp: 2025-09-04T12:35:22.872Z
Learning: In the Fluent Bit S3 plugin, the user prefers to maintain current retry_limit behavior without special handling for FLB_OUT_RETRY_UNLIMITED (-1), as there's no documentation indicating -1 should be used for infinite retries and consistency with current logic is preferred.

Applied to files:

  • plugins/out_s3/s3.c
🪛 Cppcheck (2.19.0)
plugins/out_s3/s3.c

[error] 791-791: failed to expand 'flb_plg_error', it is invalid to use a preprocessor directive as macro parameter

(syntaxError)

🪛 LanguageTool
examples/s3_output/README.md

[grammar] ~56-~56: Ensure spelling is correct
Context: ..., 201, 404, 500, 401) - Response times (2ms to 5000ms) - Client information (IP, us...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[grammar] ~465-~465: Ensure spelling is correct
Context: ...les uploaded when reaching 50MB (before 60min timeout) Verification Steps: 1. R...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🪛 markdownlint-cli2 (0.18.1)
examples/s3_output/README.md

112-112: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


177-177: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


270-270: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🔇 Additional comments (11)
plugins/out_s3/s3.h (1)

123-199: New format/schema fields and S3 format macros look consistent with implementation

Appending format, schema_str, the JSON/PARQUET macros, and the updated s3_put_object prototype matches how s3.c now routes JSON vs Parquet and SDS bodies. No structural or API consistency issues here.

Also applies to: 201-204, 233-234

plugins/out_s3/s3.c (7)

192-239: Header logic correctly skips outer Content-Encoding for Parquet

The updated create_headers now omits Content-Encoding when ctx->format == FLB_S3_FORMAT_PARQUET, matching the fact that Parquet uses its own internal compression. This aligns headers with the new upload semantics.


777-823: Format selection and Parquet encoder guards are consistent with new design

The new format parsing and #ifdef FLB_HAVE_PARQUET_ENCODER checks correctly:

  • Default to JSON unless Parquet was set (including via legacy compression mapping once fixed).
  • Reject format=parquet builds that lack FLB_PARQUET_ENCODER.
  • For Parquet builds, keep ctx->compression for internal Parquet compression only and skip outer compression.

This matches the intended separation of JSON vs Parquet and internal-vs-outer compression.


1663-1800: Updated s3_put_object signature and SDS handling are consistent

The new s3_put_object taking flb_sds_t body, size_t body_size is wired correctly:

  • Callers pass SDS buffers and later destroy them (or the owning wrappers).
  • MD5 computation and HTTP request use the SDS pointer as a regular char *.
  • Error paths free uri and preserve seq_index semantics via the decrement_index label.

No memory lifetime issues are apparent around the new body type.


1991-2015: SDS buffer ownership in send_upload_request and unit_test_flush is now explicit

Both paths:

  • Create an SDS buffer via construct_request_buffer,
  • Pass it into upload_data,
  • Always flb_sds_destroy(buffer) after the call,

which removes the previous mixed ownership around raw buffers and avoids leaks in error paths.

Also applies to: 3706-3724


3953-4089: Flush path correctly preserves raw MsgPack and defers formatting

cb_s3_flush now:

  • Stores raw MsgPack into an SDS (chunk),
  • Uses s3_store_file_get/buffer_chunk to persist raw MsgPack,
  • Only converts to JSON/Parquet inside construct_request_buffer right before upload.

This is the right abstraction for supporting multiple output formats (JSON vs Parquet) without duplicating buffer logic.


4210-4231: Compression/format/schema/upload_timeout config docs largely match behavior, but note legacy compression caveat

The updated config map entries:

  • Document compression as none|gzip|snappy|zstd with legacy arrow/parquet values deprecated,
  • Introduce format and schema_str (required for format=parquet),
  • Add a documented upload_timeout parameter, and
  • Add authorization_endpoint_bearer_token,

which aligns with the new logic in cb_s3_init and the Parquet encoder path—once the legacy compression=arrow|parquet handling is fixed as noted above.

Also applies to: 4381-4413


1450-1660: Raw MsgPack buffering + deferred JSON/Parquet conversion is correctly implemented

The new construct_request_buffer / put_all_chunks flow:

  • Reads raw MsgPack from s3_store_file_read,
  • Optionally appends in-memory MsgPack (new_data) via flb_realloc + memcpy,
  • Converts once per upload to either:
    • log_key projection,
    • JSON lines, or
    • Parquet via flb_msgpack_raw_to_parquet (with schema_str and ctx->compression),
  • Properly cleans up all internal allocations (msgpack buffer, intermediate formatted data) on all error paths before returning,
  • Never sets *out_buf on error, ensuring callers receive only valid or NULL pointers, and
  • Returns an flb_sds_t which all four callers properly destroy on the success path.

Chunk locking/unlocking on error paths is balanced throughout. Memory management is sound.

src/CMakeLists.txt (1)

157-166: Parquet encoder and SASL library wiring in src/CMakeLists.txt look correct

  • flb_parquet_impl.cpp is only built when FLB_PARQUET_ENCODER is enabled and is explicitly marked as C++, matching the Arrow/Parquet dependency.
  • ARROW_LIBRARIES / PARQUET_LIBRARIES are only appended to FLB_DEPS under the same flag, consistent with cmake/parquet.cmake.
  • SASL is linked into both fluent-bit-shared and fluent-bit-bin only when FLB_SASL_ENABLED is set, so non-Kafka or non-SASL builds remain unaffected.

Also applies to: 398-405, 525-528, 600-603

include/fluent-bit/aws/flb_aws_compress.h (1)

23-47: Compression enum layout is fine; ensure callers don’t rely on ARROW/PARQUET in flb_aws_compression_get_type

The new assignments (NONE=0, GZIP=1, SNAPPY=2, ZSTD=3, ARROW=4 [deprecated], PARQUET=5 [“use format=parquet instead”]) are internally consistent and safe as long as all code uses the macros, not raw integers.

Given flb_aws_compress.c now only recognizes gzip|snappy|zstd, ARROW/PARQUET are effectively compile-time constants for legacy conditionals; callers must not expect flb_aws_compression_get_type() to ever return these values anymore. The S3 plugin currently still does (and should be fixed as noted in s3.c).

src/aws/flb_aws_compress.c (1)

24-38: SNAPPY support is wired correctly; note that ARROW/PARQUET are no longer valid compression types

  • flb_snappy_compress_wrapper correctly adapts flb_snappy_compress to the common (void **out_data, size_t *out_len) signature and is registered under "snappy" in compression_options.
  • The options table now only contains true compression algorithms (gzip/snappy/zstd); ARROW/PARQUET have been removed, so flb_aws_compression_get_type("arrow") / "parquet" will now return -1.

This matches the new separation between compression and format conversion, but requires callers (notably the S3 plugin) to treat "arrow" / "parquet" as legacy format aliases rather than compression types—see the earlier comment in s3.c.

Also applies to: 57-78

@kalavt kalavt force-pushed the feature/aws-s3-parquet branch from 5e868fc to 2555471 Compare December 24, 2025 15:03
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (3)
examples/s3_output/athena-queries.sql (1)

168-181: Replace PERCENTILE with approx_percentile for Athena compatibility.

Athena does not support the PERCENTILE(duration_ms, 0.95) syntax. Use approx_percentile(duration_ms, 0.95) instead.

🔎 Proposed fix
 SELECT
     path,
     method,
     COUNT(*) as request_count,
     AVG(duration_ms) as avg_duration_ms,
-    PERCENTILE(duration_ms, 0.95) as p95_duration_ms,
+    approx_percentile(duration_ms, 0.95) as p95_duration_ms,
     MAX(duration_ms) as max_duration_ms
 FROM access_logs_parquet
cmake/parquet.cmake (1)

149-170: Remove the include/arrow suffix from Arrow header search.

The PATH_SUFFIXES include/arrow in find_path for Arrow headers is incorrect. When searching for arrow/api.h, CMake would find it at ${prefix}/include/arrow/arrow/api.h which doesn't exist. The correct include directory should be ${prefix}/include so that #include <arrow/api.h> resolves correctly.

🔎 Proposed fix
   # Search for Arrow
   if(NOT ARROW_FOUND)
     find_path(ARROW_INCLUDE_DIR
       NAMES arrow/api.h
       PATHS ${SEARCH_PATHS}
-      PATH_SUFFIXES ${INCLUDE_SUFFIXES} include/arrow
+      PATH_SUFFIXES ${INCLUDE_SUFFIXES}
       NO_DEFAULT_PATH
     )
plugins/out_s3/s3.c (1)

744-824: Legacy compression path should explicitly default to GZIP.

The comment at line 764 says "Keep default gzip compression" but ctx->compression remains FLB_AWS_COMPRESS_NONE (set at line 745). This contradicts user expectations for backward compatibility.

🔎 Suggested fix
 #ifndef FLB_HAVE_PARQUET_ENCODER
             flb_plg_error(ctx->ins,
                          "Parquet format is not supported in this build. "
                          "Rebuild with -DFLB_PARQUET_ENCODER=On.");
             return -1;
 #else
             ctx->format = FLB_S3_FORMAT_PARQUET;
-            /* Keep default gzip compression */
+            /* Default to gzip compression for backward compatibility */
+            if (ctx->compression == FLB_AWS_COMPRESS_NONE) {
+                ctx->compression = FLB_AWS_COMPRESS_GZIP;
+            }
 #endif
🧹 Nitpick comments (3)
examples/s3_output/README.md (2)

112-114: Add language specifier to fenced code block.

The code block at line 112 is missing a language specifier, which affects syntax highlighting and linting.

🔎 Proposed fix
-```
+```text
 Upload Trigger = upload_timeout reached OR total_file_size reached
</details>

---

`177-181`: **Add language specifier to fenced code block.**

The code block showing S3 upload paths is missing a language specifier.



<details>
<summary>🔎 Proposed fix</summary>

```diff
-```
+```text
 s3://s3-bucket-kafka-sink/fluent-bit-logs/
 ├── putobject/app/day=YYYYMMDD/      # Case 1: Small files via PutObject
 └── multipart/access/day=YYYYMMDD/   # Case 2: Large files via Multipart Upload
</details>

</blockquote></details>
<details>
<summary>src/flb_parquet_impl.cpp (1)</summary><blockquote>

`708-736`: **Memory management is correct, but ownership should be documented.**

The function allocates output_buffer with flb_malloc (line 710) and copies Arrow data into it (line 716). Exception handlers properly clean up on error (line 726). 

**Suggestion:** Add a comment in the function documentation (or in the header file) explicitly stating that the caller must free the returned buffer with flb_free().

</blockquote></details>

</blockquote></details>

<details>
<summary>📜 Review details</summary>

**Configuration used**: defaults

**Review profile**: CHILL

**Plan**: Pro

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between 5e868fceaf3521954dee257936901981215314ea and 2555471660ff1c95ba1231130e46af6cf36d9017.

</details>

<details>
<summary>📒 Files selected for processing (19)</summary>

* `.github/workflows/unit-tests.yaml`
* `CMakeLists.txt`
* `cmake/parquet.cmake`
* `examples/s3_output/README.md`
* `examples/s3_output/athena-queries.sql`
* `examples/s3_output/fluent-bit-s3-parquet.conf`
* `examples/s3_output/run-s3-parquet-test.sh`
* `include/fluent-bit/aws/flb_aws_compress.h`
* `include/fluent-bit/flb_parquet.h`
* `plugins/out_s3/s3.c`
* `plugins/out_s3/s3.h`
* `src/CMakeLists.txt`
* `src/aws/CMakeLists.txt`
* `src/aws/compression/CMakeLists.txt`
* `src/aws/compression/arrow/CMakeLists.txt`
* `src/aws/compression/arrow/compress.c`
* `src/aws/compression/arrow/compress.h`
* `src/aws/flb_aws_compress.c`
* `src/flb_parquet_impl.cpp`

</details>

<details>
<summary>💤 Files with no reviewable changes (5)</summary>

* src/aws/compression/arrow/compress.h
* src/aws/CMakeLists.txt
* src/aws/compression/CMakeLists.txt
* src/aws/compression/arrow/CMakeLists.txt
* src/aws/compression/arrow/compress.c

</details>

<details>
<summary>🚧 Files skipped from review as they are similar to previous changes (4)</summary>

* include/fluent-bit/flb_parquet.h
* src/aws/flb_aws_compress.c
* examples/s3_output/run-s3-parquet-test.sh
* examples/s3_output/fluent-bit-s3-parquet.conf

</details>

<details>
<summary>🧰 Additional context used</summary>

<details>
<summary>🧠 Learnings (16)</summary>

<details>
<summary>📓 Common learnings</summary>

Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: src/aws/flb_aws_compress.c:52-56
Timestamp: 2025-08-29T06:24:55.855Z
Learning: ZSTD compression is always available in Fluent Bit and does not require conditional compilation guards. Unlike Arrow/Parquet which use #ifdef FLB_HAVE_ARROW guards, ZSTD is built unconditionally with flb_zstd.c included directly in src/CMakeLists.txt and a bundled ZSTD library at lib/zstd-1.5.7/.


Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:93-107
Timestamp: 2025-08-29T06:25:27.250Z
Learning: In Fluent Bit, ZSTD compression is enabled by default and is treated as a core dependency, not requiring conditional compilation guards like #ifdef FLB_HAVE_ZSTD. Unlike some other optional components such as ARROW/PARQUET (which use #ifdef FLB_HAVE_ARROW guards), ZSTD support is always available and doesn't need build-time conditionals. ZSTD headers are included directly without guards across multiple plugins and core components.


</details>
<details>
<summary>📚 Learning: 2025-08-07T10:15:46.187Z</summary>

Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 10691
File: .github/workflows/unit-tests.yaml:94-101
Timestamp: 2025-08-07T10:15:46.187Z
Learning: Apache Arrow latest packages have changing checksums, making hardcoded checksum verification impractical for CI workflows that download the latest version.


**Applied to files:**
- `cmake/parquet.cmake`
- `.github/workflows/unit-tests.yaml`

</details>
<details>
<summary>📚 Learning: 2025-08-31T12:46:11.940Z</summary>

Learnt from: ThomasDevoogdt
Repo: fluent/fluent-bit PR: 9277
File: .github/workflows/pr-compile-check.yaml:147-151
Timestamp: 2025-08-31T12:46:11.940Z
Learning: In fluent-bit CMakeLists.txt, the system library preference flags are defined as FLB_PREFER_SYSTEM_LIB_ZSTD and FLB_PREFER_SYSTEM_LIB_KAFKA with the FLB_ prefix.


**Applied to files:**
- `cmake/parquet.cmake`
- `.github/workflows/unit-tests.yaml`
- `src/CMakeLists.txt`
- `CMakeLists.txt`

</details>
<details>
<summary>📚 Learning: 2025-08-31T12:46:11.940Z</summary>

Learnt from: ThomasDevoogdt
Repo: fluent/fluent-bit PR: 9277
File: .github/workflows/pr-compile-check.yaml:147-151
Timestamp: 2025-08-31T12:46:11.940Z
Learning: In fluent-bit, the correct CMake flag for using system librdkafka is FLB_PREFER_SYSTEM_LIB_KAFKA=ON.


**Applied to files:**
- `cmake/parquet.cmake`
- `.github/workflows/unit-tests.yaml`
- `src/CMakeLists.txt`
- `CMakeLists.txt`

</details>
<details>
<summary>📚 Learning: 2025-08-29T06:25:27.250Z</summary>

Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:93-107
Timestamp: 2025-08-29T06:25:27.250Z
Learning: In Fluent Bit, ZSTD compression is enabled by default and is treated as a core dependency, not requiring conditional compilation guards like #ifdef FLB_HAVE_ZSTD. Unlike some other optional components such as ARROW/PARQUET (which use #ifdef FLB_HAVE_ARROW guards), ZSTD support is always available and doesn't need build-time conditionals. ZSTD headers are included directly without guards across multiple plugins and core components.


**Applied to files:**
- `.github/workflows/unit-tests.yaml`
- `include/fluent-bit/aws/flb_aws_compress.h`
- `plugins/out_s3/s3.c`
- `src/CMakeLists.txt`
- `CMakeLists.txt`

</details>
<details>
<summary>📚 Learning: 2025-08-29T06:24:55.855Z</summary>

Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: src/aws/flb_aws_compress.c:52-56
Timestamp: 2025-08-29T06:24:55.855Z
Learning: ZSTD compression is always available in Fluent Bit and does not require conditional compilation guards. Unlike Arrow/Parquet which use #ifdef FLB_HAVE_ARROW guards, ZSTD is built unconditionally with flb_zstd.c included directly in src/CMakeLists.txt and a bundled ZSTD library at lib/zstd-1.5.7/.


**Applied to files:**
- `.github/workflows/unit-tests.yaml`
- `src/flb_parquet_impl.cpp`
- `include/fluent-bit/aws/flb_aws_compress.h`
- `plugins/out_s3/s3.c`
- `src/CMakeLists.txt`
- `CMakeLists.txt`

</details>
<details>
<summary>📚 Learning: 2025-08-29T06:25:27.250Z</summary>

Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:93-107
Timestamp: 2025-08-29T06:25:27.250Z
Learning: In Fluent Bit, ZSTD compression is enabled by default and is treated as a core dependency, not requiring conditional compilation guards like #ifdef FLB_HAVE_ZSTD. Unlike some other optional components, ZSTD support is always available and doesn't need build-time conditionals.


**Applied to files:**
- `.github/workflows/unit-tests.yaml`
- `include/fluent-bit/aws/flb_aws_compress.h`
- `plugins/out_s3/s3.c`
- `src/CMakeLists.txt`
- `CMakeLists.txt`

</details>
<details>
<summary>📚 Learning: 2025-08-29T06:24:26.170Z</summary>

Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:39-42
Timestamp: 2025-08-29T06:24:26.170Z
Learning: In Fluent Bit, ZSTD compression support is enabled by default and does not require conditional compilation guards (like #ifdef FLB_HAVE_ZSTD) around ZSTD-related code declarations and implementations.


**Applied to files:**
- `.github/workflows/unit-tests.yaml`
- `include/fluent-bit/aws/flb_aws_compress.h`
- `plugins/out_s3/s3.c`
- `src/CMakeLists.txt`
- `CMakeLists.txt`

</details>
<details>
<summary>📚 Learning: 2025-08-29T06:24:44.797Z</summary>

Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: src/aws/flb_aws_compress.c:26-26
Timestamp: 2025-08-29T06:24:44.797Z
Learning: In Fluent Bit, ZSTD support is always available and enabled by default. The build system automatically detects and uses either the system libzstd library or builds the bundled ZSTD version. Unlike other optional dependencies like Arrow which use conditional compilation guards (e.g., FLB_HAVE_ARROW), ZSTD does not require conditional includes or build flags.


**Applied to files:**
- `.github/workflows/unit-tests.yaml`
- `src/CMakeLists.txt`
- `CMakeLists.txt`

</details>
<details>
<summary>📚 Learning: 2025-09-08T11:21:33.975Z</summary>

Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 10851
File: include/fluent-bit/flb_simd.h:60-66
Timestamp: 2025-09-08T11:21:33.975Z
Learning: Fluent Bit currently only supports MSVC compiler on Windows, so additional compiler compatibility guards may be unnecessary for Windows-specific code paths.


**Applied to files:**
- `.github/workflows/unit-tests.yaml`
- `src/CMakeLists.txt`
- `CMakeLists.txt`

</details>
<details>
<summary>📚 Learning: 2025-08-29T06:25:02.561Z</summary>

Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:7-7
Timestamp: 2025-08-29T06:25:02.561Z
Learning: In Fluent Bit, ZSTD (zstandard) compression library is bundled directly in the source tree at lib/zstd-1.5.7 and is built unconditionally as a static library. Unlike optional external dependencies, ZSTD does not use conditional compilation guards like FLB_HAVE_ZSTD and is always available. Headers like <fluent-bit/flb_zstd.h> can be included directly without guards.


**Applied to files:**
- `include/fluent-bit/aws/flb_aws_compress.h`
- `plugins/out_s3/s3.c`
- `src/CMakeLists.txt`
- `CMakeLists.txt`

</details>
<details>
<summary>📚 Learning: 2025-09-04T12:35:22.872Z</summary>

Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10825
File: plugins/out_s3/s3.c:1339-1344
Timestamp: 2025-09-04T12:35:22.872Z
Learning: In the Fluent Bit S3 plugin, the user prefers to maintain current retry_limit behavior without special handling for FLB_OUT_RETRY_UNLIMITED (-1), as there's no documentation indicating -1 should be used for infinite retries and consistency with current logic is preferred.


**Applied to files:**
- `plugins/out_s3/s3.c`

</details>
<details>
<summary>📚 Learning: 2025-12-22T05:39:02.291Z</summary>

Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 11250
File: lib/ripser-1.2.1/Makefile:1-18
Timestamp: 2025-12-22T05:39:02.291Z
Learning: In fluent/fluent-bit, the Makefile at lib/ripser-1.2.1/Makefile is imported from upstream Ripser and is not used in the actual build process. The project uses CMake for building (lib/ripser-1.2.1/CMakeLists.txt), so changes to the imported Makefile are not necessary.


**Applied to files:**
- `src/CMakeLists.txt`

</details>
<details>
<summary>📚 Learning: 2025-09-14T09:46:09.531Z</summary>

Learnt from: aminvakil
Repo: fluent/fluent-bit PR: 10844
File: conf/fluent-bit:13-15
Timestamp: 2025-09-14T09:46:09.531Z
Learning: For fluent-bit Debian packaging, /opt/fluent-bit/bin/ is the appropriate installation path since the package may be installed from non-official Debian sources, making /opt compliant with FHS for optional software packages.


**Applied to files:**
- `src/CMakeLists.txt`

</details>
<details>
<summary>📚 Learning: 2025-09-04T07:28:37.083Z</summary>

Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 10832
File: src/wasm/CMakeLists.txt:112-131
Timestamp: 2025-09-04T07:28:37.083Z
Learning: In fluent-bit CMake files, the user cosmo0920 prefers treating Git as a command rather than a package, emphasizing that Git is not a pkg-config retrievable package but just a command.


**Applied to files:**
- `src/CMakeLists.txt`

</details>
<details>
<summary>📚 Learning: 2025-11-21T06:23:29.770Z</summary>

Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 11171
File: include/fluent-bit/flb_lib.h:52-53
Timestamp: 2025-11-21T06:23:29.770Z
Learning: In Fluent Bit core (fluent/fluent-bit repository), function descriptions/documentation are not required for newly added functions in header files.


**Applied to files:**
- `examples/s3_output/README.md`

</details>

</details><details>
<summary>🧬 Code graph analysis (2)</summary>

<details>
<summary>src/flb_parquet_impl.cpp (1)</summary><blockquote>

<details>
<summary>include/fluent-bit/flb_mem.h (1)</summary>

* `flb_free` (126-128)

</details>

</blockquote></details>
<details>
<summary>plugins/out_s3/s3.h (1)</summary><blockquote>

<details>
<summary>plugins/out_s3/s3.c (1)</summary>

* `s3_put_object` (1663-1800)

</details>

</blockquote></details>

</details><details>
<summary>🪛 Cppcheck (2.19.0)</summary>

<details>
<summary>plugins/out_s3/s3.c</summary>

[error] 791-791: failed to expand 'flb_plg_error', it is invalid to use a preprocessor directive as macro parameter

(syntaxError)

</details>

</details>
<details>
<summary>🪛 LanguageTool</summary>

<details>
<summary>examples/s3_output/README.md</summary>

[grammar] ~56-~56: Ensure spelling is correct
Context: ..., 201, 404, 500, 401) - Response times (2ms to 5000ms) - Client information (IP, us...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

---

[grammar] ~465-~465: Ensure spelling is correct
Context: ...les uploaded when reaching 50MB (before 60min timeout)  **Verification Steps:**  1. R...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

</details>

</details>
<details>
<summary>🪛 markdownlint-cli2 (0.18.1)</summary>

<details>
<summary>examples/s3_output/README.md</summary>

112-112: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

---

177-177: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

---

270-270: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

</details>

</details>

</details>

<details>
<summary>🔇 Additional comments (28)</summary><blockquote>

<details>
<summary>.github/workflows/unit-tests.yaml (2)</summary><blockquote>

`56-79`: **CI matrix configuration for Parquet encoder looks correct.**

The Parquet encoder test configuration properly:
- Uses the new `FLB_PARQUET_ENCODER` flag
- Excludes Clang compiler (consistent with other specialized build options like coverage)
- Aligns with the broader transition from Arrow GLib to C++ Arrow

---

`117-125`: **Apache Arrow setup uses correct packages and official distribution source.**

The configuration correctly:
- Uses `packages.apache.org` as the official distribution source
- Installs `libarrow-dev` and `libparquet-dev` (C++ libraries) instead of the GLib variants
- Aligns the step name and condition with the new `FLB_PARQUET_ENCODER` flag

</blockquote></details>
<details>
<summary>include/fluent-bit/aws/flb_aws_compress.h (2)</summary><blockquote>

`33-46`: **Good documentation distinguishing compression from format conversion.**

The comment block clearly explains:
- The distinction between actual compression algorithms and file format conversions
- Arrow's deprecation status with guidance not to use in new code
- That Parquet should be used via `format=parquet` rather than `compression=parquet`

This helps prevent user confusion about the difference between compression and format conversion.

---

`24-46`: The concern about backward compatibility from renumbering constants is not applicable. The compression type configuration uses string-based names ("gzip", "snappy", "zstd", "parquet") via an internal lookup table, not numeric values. The enum constants are purely internal implementation details that never get serialized, persisted, or exposed to external code. Renumbering these internal values has no backward compatibility impact.

</blockquote></details>
<details>
<summary>examples/s3_output/athena-queries.sql (1)</summary><blockquote>

`1-80`: **Well-structured Athena table definitions and query examples.**

The SQL file provides comprehensive examples covering:
- Partition projection for automatic partition discovery
- Both JSON and Parquet format table definitions
- Realistic query patterns for log analysis

This is valuable documentation for users implementing the new Parquet format support.

</blockquote></details>
<details>
<summary>examples/s3_output/README.md (1)</summary><blockquote>

`1-80`: **Excellent documentation for Parquet format support.**

The README provides comprehensive coverage of:
- Upload methods (PutObject vs Multipart)
- Format and compression options
- Schema definition and type handling with best practices
- Troubleshooting guidance
- Performance testing scenarios

This is high-quality documentation that will help users adopt the new Parquet format feature.

</blockquote></details>
<details>
<summary>plugins/out_s3/s3.h (2)</summary><blockquote>

`195-203`: **Format and schema fields properly added for Parquet support.**

The new struct members and constants are well-organized:
- `format` field for output format selection
- `schema_str` for user-defined Parquet schema
- Clear format type constants with JSON=0 as default

---

`233-234`: **Updated `s3_put_object` signature uses SDS buffer type.**

The function signature change from raw buffer to `flb_sds_t` aligns with the SDS-based buffer handling throughout the S3 upload flow.

</blockquote></details>
<details>
<summary>src/CMakeLists.txt (3)</summary><blockquote>

`157-166`: **Parquet encoder C++ integration correctly configured.**

The Parquet encoder integration:
- Adds `flb_parquet_impl.cpp` under `FLB_PARQUET_ENCODER` guard
- Properly marks the file as C++ using `set_source_files_properties`
- Follows the existing pattern for conditional source inclusion

---

`398-405`: **Parquet library dependencies properly linked.**

The `ARROW_LIBRARIES` and `PARQUET_LIBRARIES` are correctly added to `FLB_DEPS` when `FLB_PARQUET_ENCODER` is enabled, ensuring proper linkage with the detected Arrow/Parquet libraries from `cmake/parquet.cmake`.

---

`525-528`: **Verify SASL linking changes are intentionally part of this PR.**

The SASL library linking for Kafka support (`sasl2`) is added here but appears unrelated to the Parquet encoder feature. Please confirm this change is intentional for this PR or if it should be in a separate commit.




Also applies to: 600-605

</blockquote></details>
<details>
<summary>cmake/parquet.cmake (1)</summary><blockquote>

`1-83`: **Well-designed cross-platform library detection module.**

The parquet.cmake module provides robust library detection with:
- Three fallback methods (pkg-config → CMake config → manual search)
- Platform-specific path handling for Windows, macOS (including Apple Silicon), and Linux
- Proper handling of both shared and static library targets
- Clear status messaging

</blockquote></details>
<details>
<summary>CMakeLists.txt (3)</summary><blockquote>

`202-202`: **New `FLB_PARQUET_ENCODER` option properly defined.**

The option follows the existing naming convention for encoders (`FLB_AVRO_ENCODER`) and is disabled by default, which is appropriate for an optional feature requiring external dependencies.

---

`284-286`: **`FLB_ALL` properly enables both encoders.**

The "Enable all features" mode now consistently enables both Avro and Parquet encoders.

---

`1311-1329`: **Parquet encoder integration is well-structured.**

The configuration block:
- Enables C++ language support as required by Arrow/Parquet
- Uses C++17 standard (addressing past review comment about Arrow 10.0+ requiring C++17)
- Integrates with `cmake/parquet.cmake` for library detection
- Gracefully disables the feature with warnings when dependencies are missing

</blockquote></details>
<details>
<summary>src/flb_parquet_impl.cpp (6)</summary><blockquote>

`1-40`: **LGTM: Headers and includes are properly organized.**

The extern "C" wrapper correctly isolates C headers from C++ code, and all necessary dependencies are included.

---

`41-94`: **LGTM: Converter class design is clean and comprehensive.**

The statistics tracking for overflows and type conversions will help users identify data quality issues.

---

`139-158`: **Overflow handling correctly implemented.**

The uint64 to int64 overflow check at lines 142-146 properly addresses the previous review concern. Values exceeding INT64_MAX are clamped and tracked in statistics.

---

`220-257`: **Timestamp unit handling correctly implemented.**

The function now properly reads the target time unit from the builder's type (lines 221-223) and applies the appropriate multiplier. This addresses the previous review concern.

**Note:** The code assumes input timestamps are in seconds (line 225). Ensure this is documented in the public API or consider accepting a parameter for input unit.

---

`458-547`: **Excellent: Immediate record processing avoids dangling pointer issue.**

The lambda processes each msgpack record immediately within the unpack loop (lines 469-546), which correctly addresses the past review comment about storing msgpack_object pointers that become invalid. The use of string_view for field lookup (line 493) also avoids unnecessary string copies.

---

`332-443`: **Schema parsing is robust with good error messages.**

The JSON schema parsing properly validates required fields (lines 365-377) and provides helpful error messages that include the problematic schema content (lines 337-338, 347-348). Timestamp unit parsing (lines 410-420) correctly handles all standard Arrow time units.

</blockquote></details>
<details>
<summary>plugins/out_s3/s3.c (7)</summary><blockquote>

`42-57`: **LGTM: Proper conditional compilation for Parquet support.**

The FLB_HAVE_PARQUET_ENCODER guard ensures the header is only included when Parquet support is enabled.

---

`195-229`: **Correct: Content-Encoding header skipped for Parquet.**

Lines 196-197 and 228-229 properly skip the Content-Encoding header when `format=parquet`, since Parquet handles compression internally within the file format. This avoids double-compression overhead.

---

`1244-1277`: **LGTM: Compression properly skipped for Parquet with correct memory management.**

The check at line 1245 ensures Parquet format bypasses outer compression. The compressed data is properly converted to flb_sds_t (lines 1260-1272) for consistent memory management, and cleanup is handled in all code paths (lines 1333-1335, 1362-1364, 1377-1379, 1388-1391).

---

`1582-1630`: **Parquet conversion properly integrated with good error handling.**

The Parquet conversion block (lines 1582-1630) correctly:
- Validates schema_str is provided (lines 1588-1596)
- Calls the Parquet encoder with compression setting (lines 1598-1603)
- Converts the result to flb_sds_t for consistent memory management (lines 1615-1627)
- Frees the intermediate parquet_buffer (line 1617)
- Handles all error paths with proper cleanup




Verify that the schema_str requirement is documented in user-facing documentation and error messages are clear enough for users to understand the schema format.

---

`1663-1664`: **LGTM: Signature change to flb_sds_t improves consistency.**

The s3_put_object signature change to accept flb_sds_t (line 1664) aligns with the overall refactoring for consistent SDS-based buffer management.

---

`3953-3962`: **Excellent architectural decision with clear documentation.**

The comment at lines 3953-3956 clearly explains the design choice to store raw msgpack and defer format conversion until upload. This is particularly beneficial for Parquet, which can batch multiple records into optimized columnar storage.

---

`4210-4231`: **Config map correctly defined with comprehensive documentation.**

The configuration parameters are properly defined:
- `compression` and `format` use offset 0 (lines 4211, 4220), indicating manual parsing in cb_s3_init, which is correct
- `schema_str` is mapped to `ctx->schema_str` (line 4228)
- Descriptions clearly document supported values, defaults, and usage (lines 4212-4217, 4221-4225, 4229-4231)

**Note:** The past review comment about compression type mismatch is not applicable here since offset 0 means no automatic field mapping.

</blockquote></details>

</blockquote></details>

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

@kalavt kalavt force-pushed the feature/aws-s3-parquet branch from 664cb4e to eff671c Compare December 24, 2025 16:00
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
plugins/out_s3/s3.c (2)

1217-1324: Use size_to_send (compressed size) for routing decisions and logging

Inside upload_data(), after optional compression you introduce:

flb_sds_t body_to_send = body;
size_t size_to_send = body_size;
...
preCompress_size = body_size;
body_to_send = compressed_body;
size_to_send = temp_size;

but the subsequent logic still uses the original body_size for decisions and logging:

  • Timeout/size checks for whether to goto put_object vs multipart.
  • The “Pre-compression ... After compression ...” log still uses body_size for the “after” value.
  • Existing upload path compares body_size to MIN_CHUNKED_UPLOAD_SIZE.

With compression enabled, this means routing decisions and logs are based on the uncompressed size, which diverges from the actual payload size now stored in size_to_send. It also breaks the intent of the informational log (both values end up identical).

Switching those uses of body_size to size_to_send preserves the old semantics (decisions based on the payload actually sent) and fixes the log.

Suggested diff
-        else if (body_size >= ctx->file_size) {
+        else if (size_to_send >= ctx->file_size) {
@@
-        else if(body_size > MIN_CHUNKED_UPLOAD_SIZE) {
+        else if(size_to_send > MIN_CHUNKED_UPLOAD_SIZE) {
@@
-        else {
-            if ((ctx->use_put_object == FLB_FALSE && (ctx->compression == FLB_AWS_COMPRESS_GZIP || ctx->compression == FLB_AWS_COMPRESS_ZSTD))) {
-                flb_plg_info(ctx->ins, "Pre-compression upload_chunk_size= %zu, After compression, chunk is only %zu bytes, "
-                                       "the chunk was too small, using PutObject to upload", preCompress_size, body_size);
-            }
+        else {
+            if (ctx->use_put_object == FLB_FALSE &&
+                (ctx->compression == FLB_AWS_COMPRESS_GZIP ||
+                 ctx->compression == FLB_AWS_COMPRESS_ZSTD)) {
+                flb_plg_info(ctx->ins,
+                             "Pre-compression upload_chunk_size= %zu, "
+                             "After compression, chunk is only %zu bytes, "
+                             "the chunk was too small, using PutObject to upload",
+                             preCompress_size, size_to_send);
+            }
@@
-        if (body_size < MIN_CHUNKED_UPLOAD_SIZE) {
+        if (size_to_send < MIN_CHUNKED_UPLOAD_SIZE) {

1450-1500: SDS lifetime is mostly correct; fix leak in unit_test_flush

The SDS refactor is generally consistent:

  • construct_request_buffer() now always returns flb_sds_t and callers destroy buffers with flb_sds_destroy().
  • send_upload_request(), put_all_chunks(), and cb_s3_upload() correctly treat the returned buffer as an SDS and free it once upload_data() completes.
  • upload_data() cleans up the optional compressed_body SDS in all exit paths.

One leak remains in unit_test_flush():

s3_store_buffer_put(ctx, upload_file, tag, tag_len,
                    chunk, (size_t) chunk_size, file_first_log_time);
ret = construct_request_buffer(ctx, chunk, upload_file, &buffer, &buffer_size);
...
ret = upload_data(ctx, upload_file, m_upload_file, buffer, buffer_size, tag, tag_len);
/* buffer destroyed, but chunk is never freed */

Unlike buffer_chunk() and other call sites, chunk is never destroyed here, so every unit‑test flush leaks one SDS buffer when FLB_S3_PLUGIN_UNDER_TEST is set.

You can mirror buffer_chunk() and free chunk immediately after construct_request_buffer():

Suggested diff
-    s3_store_buffer_put(ctx, upload_file, tag, tag_len,
-                        chunk, (size_t) chunk_size, file_first_log_time);
-    ret = construct_request_buffer(ctx, chunk, upload_file, &buffer, &buffer_size);
-    if (ret < 0) {
+    s3_store_buffer_put(ctx, upload_file, tag, tag_len,
+                        chunk, (size_t) chunk_size, file_first_log_time);
+    ret = construct_request_buffer(ctx, chunk, upload_file, &buffer, &buffer_size);
+    flb_sds_destroy(chunk);
+    if (ret < 0) {
         flb_plg_error(ctx->ins, "Could not construct request buffer for %s",
                       upload_file->file_path);
         FLB_OUTPUT_RETURN(FLB_RETRY);
     }

Also applies to: 1991-2015, 3701-3724

🧹 Nitpick comments (9)
examples/s3_output/run-s3-parquet-test.sh (2)

1-5: Consider adding shell script safety options.

For test scripts, adding safety options helps catch errors early:

🔎 Suggested improvement
 #!/bin/bash
+set -euo pipefail
 
 echo "=== Fluent Bit S3 Comprehensive Test ==="
 echo ""

37-39: Cleanup section appears incomplete.

The script echoes "Cleaning up..." but doesn't perform any actual cleanup. If cleanup is intentionally omitted (e.g., to allow inspection of test artifacts), consider adding a comment explaining this. Otherwise, add the cleanup logic:

🔎 Suggested completion
 # Cleanup
 echo ""
 echo "Cleaning up..."
+# Uncomment below to remove test artifacts after run
+# rm -rf "$STORE_DIR"
+echo "Test artifacts preserved in $STORE_DIR for inspection"
examples/s3_output/README.md (3)

112-114: Add language specifier to fenced code block.

Per markdownlint MD040, fenced code blocks should have a language specified for proper syntax highlighting.

🔎 Suggested fix
-```
+```text
 Upload Trigger = upload_timeout reached OR total_file_size reached
</details>

---

`177-181`: **Add language specifier to fenced code block.**

<details>
<summary>🔎 Suggested fix</summary>

```diff
-```
+```text
 s3://s3-bucket-kafka-sink/fluent-bit-logs/
 ├── putobject/app/day=YYYYMMDD/      # Case 1: Small files via PutObject
 └── multipart/access/day=YYYYMMDD/   # Case 2: Large files via Multipart Upload
</details>

---

`270-275`: **Add language specifier to fenced code block.**

<details>
<summary>🔎 Suggested fix</summary>

```diff
-```
+```text
 [parquet] Data quality summary for 53206 records:
 [parquet] Missing non-nullable fields (defaults used):
 [parquet]   field='level' count=52506
 [parquet]   field='timestamp' count=52506
</details>

</blockquote></details>
<details>
<summary>cmake/parquet.cmake (1)</summary><blockquote>

`149-171`: **Manual include-path search is fine but extra `parquet` suffix is probably unnecessary**

The manual search for Parquet uses:

```cmake
PATH_SUFFIXES ${INCLUDE_SUFFIXES} parquet

Given INCLUDE_SUFFIXES is already include, this will check both ${prefix}/include/parquet/... (via include + the parquet/api/... name) and ${prefix}/parquet/..., which is a bit unusual. It’s not harmful, but simplifying to just ${INCLUDE_SUFFIXES} would avoid probing odd ${prefix}/parquet trees and be more in line with the Arrow search just above.

Also applies to: 173-195

CMakeLists.txt (1)

679-687: Consider unifying the C++ standard to avoid C++11/C++17 mix

The simdutf block sets CMAKE_CXX_STANDARD 11, while the Parquet block later sets it to 17. CMake will honor the last assignment (17 when Parquet is enabled), but this mix of settings is slightly confusing and could surprise future refactors.

You might want to:

  • Drop the explicit set(CMAKE_CXX_STANDARD 11) in the simdutf block, or
  • Set it to 17 there as well when Parquet support is enabled.

Functionally it works today; this is mainly about clarity and avoiding subtle build surprises.

Also applies to: 1311-1316

plugins/out_s3/s3.c (2)

42-57: Avoid duplicate forward declarations for get_upload / create_upload

get_upload and create_upload are forward‑declared twice (once in the new “Forward declarations” block and again just before their definitions). This is harmless but unnecessary noise.

You can keep the new declaration for flb_pack_msgpack_extract_log_key and drop the duplicated get_upload / create_upload prototypes to reduce clutter.

Also applies to: 86-91


744-823: Format/compression config flow is solid; consider early schema_str validation

The three‑step flow in cb_s3_init:

  • Defaults compression to NONE, parses the compression property, and treats arrow|parquet as legacy values that map to format=parquet (with compile‑time checks around FLB_HAVE_PARQUET_ENCODER).
  • Adds an explicit format option (json/parquet) that overrides the legacy compression mapping.
  • Validates that format=parquet is not allowed when Parquet support isn’t compiled in.

This is a good cleanup and keeps legacy configs working.

However, schema_str is only validated later inside construct_request_buffer() and flb_msgpack_raw_to_parquet, which means a misconfigured format=parquet without schema_str will only surface on first flush, failing every chunk.

To fail fast and give clearer feedback, you could add an init‑time check like:

#ifdef FLB_HAVE_PARQUET_ENCODER
if (ctx->format == FLB_S3_FORMAT_PARQUET && ctx->schema_str == NULL) {
    flb_plg_error(ctx->ins,
                  "format=parquet requires 'schema_str' to be set");
    return -1;
}
#endif

right after the format/validation block.

Also applies to: 4210-4231

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 664cb4e and eff671c.

📒 Files selected for processing (18)
  • .github/workflows/unit-tests.yaml
  • CMakeLists.txt
  • cmake/parquet.cmake
  • examples/s3_output/README.md
  • examples/s3_output/athena-queries.sql
  • examples/s3_output/fluent-bit-s3-parquet.conf
  • examples/s3_output/run-s3-parquet-test.sh
  • include/fluent-bit/aws/flb_aws_compress.h
  • plugins/out_kinesis_firehose/firehose.c
  • plugins/out_s3/s3.c
  • plugins/out_s3/s3.h
  • src/CMakeLists.txt
  • src/aws/CMakeLists.txt
  • src/aws/compression/CMakeLists.txt
  • src/aws/compression/arrow/CMakeLists.txt
  • src/aws/compression/arrow/compress.c
  • src/aws/compression/arrow/compress.h
  • src/aws/flb_aws_compress.c
💤 Files with no reviewable changes (5)
  • src/aws/compression/arrow/compress.c
  • src/aws/compression/arrow/compress.h
  • src/aws/compression/CMakeLists.txt
  • src/aws/compression/arrow/CMakeLists.txt
  • src/aws/CMakeLists.txt
🚧 Files skipped from review as they are similar to previous changes (1)
  • examples/s3_output/athena-queries.sql
🧰 Additional context used
🧠 Learnings (16)
📓 Common learnings
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: src/aws/flb_aws_compress.c:52-56
Timestamp: 2025-08-29T06:24:55.855Z
Learning: ZSTD compression is always available in Fluent Bit and does not require conditional compilation guards. Unlike Arrow/Parquet which use #ifdef FLB_HAVE_ARROW guards, ZSTD is built unconditionally with flb_zstd.c included directly in src/CMakeLists.txt and a bundled ZSTD library at lib/zstd-1.5.7/.
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:93-107
Timestamp: 2025-08-29T06:25:27.250Z
Learning: In Fluent Bit, ZSTD compression is enabled by default and is treated as a core dependency, not requiring conditional compilation guards like `#ifdef FLB_HAVE_ZSTD`. Unlike some other optional components such as ARROW/PARQUET (which use `#ifdef FLB_HAVE_ARROW` guards), ZSTD support is always available and doesn't need build-time conditionals. ZSTD headers are included directly without guards across multiple plugins and core components.
📚 Learning: 2025-08-07T10:15:46.187Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 10691
File: .github/workflows/unit-tests.yaml:94-101
Timestamp: 2025-08-07T10:15:46.187Z
Learning: Apache Arrow latest packages have changing checksums, making hardcoded checksum verification impractical for CI workflows that download the latest version.

Applied to files:

  • .github/workflows/unit-tests.yaml
  • cmake/parquet.cmake
📚 Learning: 2025-08-31T12:46:11.940Z
Learnt from: ThomasDevoogdt
Repo: fluent/fluent-bit PR: 9277
File: .github/workflows/pr-compile-check.yaml:147-151
Timestamp: 2025-08-31T12:46:11.940Z
Learning: In fluent-bit CMakeLists.txt, the system library preference flags are defined as FLB_PREFER_SYSTEM_LIB_ZSTD and FLB_PREFER_SYSTEM_LIB_KAFKA with the FLB_ prefix.

Applied to files:

  • .github/workflows/unit-tests.yaml
  • CMakeLists.txt
  • cmake/parquet.cmake
  • src/CMakeLists.txt
📚 Learning: 2025-08-31T12:46:11.940Z
Learnt from: ThomasDevoogdt
Repo: fluent/fluent-bit PR: 9277
File: .github/workflows/pr-compile-check.yaml:147-151
Timestamp: 2025-08-31T12:46:11.940Z
Learning: In fluent-bit, the correct CMake flag for using system librdkafka is `FLB_PREFER_SYSTEM_LIB_KAFKA=ON`.

Applied to files:

  • .github/workflows/unit-tests.yaml
  • CMakeLists.txt
  • cmake/parquet.cmake
  • src/CMakeLists.txt
📚 Learning: 2025-08-29T06:25:27.250Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:93-107
Timestamp: 2025-08-29T06:25:27.250Z
Learning: In Fluent Bit, ZSTD compression is enabled by default and is treated as a core dependency, not requiring conditional compilation guards like `#ifdef FLB_HAVE_ZSTD`. Unlike some other optional components such as ARROW/PARQUET (which use `#ifdef FLB_HAVE_ARROW` guards), ZSTD support is always available and doesn't need build-time conditionals. ZSTD headers are included directly without guards across multiple plugins and core components.

Applied to files:

  • .github/workflows/unit-tests.yaml
  • include/fluent-bit/aws/flb_aws_compress.h
  • src/aws/flb_aws_compress.c
  • CMakeLists.txt
  • plugins/out_kinesis_firehose/firehose.c
  • src/CMakeLists.txt
  • plugins/out_s3/s3.c
📚 Learning: 2025-08-29T06:24:55.855Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: src/aws/flb_aws_compress.c:52-56
Timestamp: 2025-08-29T06:24:55.855Z
Learning: ZSTD compression is always available in Fluent Bit and does not require conditional compilation guards. Unlike Arrow/Parquet which use #ifdef FLB_HAVE_ARROW guards, ZSTD is built unconditionally with flb_zstd.c included directly in src/CMakeLists.txt and a bundled ZSTD library at lib/zstd-1.5.7/.

Applied to files:

  • .github/workflows/unit-tests.yaml
  • include/fluent-bit/aws/flb_aws_compress.h
  • src/aws/flb_aws_compress.c
  • CMakeLists.txt
  • plugins/out_kinesis_firehose/firehose.c
  • src/CMakeLists.txt
  • plugins/out_s3/s3.c
📚 Learning: 2025-08-29T06:25:27.250Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:93-107
Timestamp: 2025-08-29T06:25:27.250Z
Learning: In Fluent Bit, ZSTD compression is enabled by default and is treated as a core dependency, not requiring conditional compilation guards like `#ifdef FLB_HAVE_ZSTD`. Unlike some other optional components, ZSTD support is always available and doesn't need build-time conditionals.

Applied to files:

  • .github/workflows/unit-tests.yaml
  • include/fluent-bit/aws/flb_aws_compress.h
  • src/aws/flb_aws_compress.c
  • CMakeLists.txt
  • plugins/out_kinesis_firehose/firehose.c
  • src/CMakeLists.txt
  • plugins/out_s3/s3.c
📚 Learning: 2025-08-29T06:24:26.170Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:39-42
Timestamp: 2025-08-29T06:24:26.170Z
Learning: In Fluent Bit, ZSTD compression support is enabled by default and does not require conditional compilation guards (like #ifdef FLB_HAVE_ZSTD) around ZSTD-related code declarations and implementations.

Applied to files:

  • .github/workflows/unit-tests.yaml
  • include/fluent-bit/aws/flb_aws_compress.h
  • src/aws/flb_aws_compress.c
  • CMakeLists.txt
  • plugins/out_kinesis_firehose/firehose.c
  • src/CMakeLists.txt
  • plugins/out_s3/s3.c
📚 Learning: 2025-08-29T06:24:44.797Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: src/aws/flb_aws_compress.c:26-26
Timestamp: 2025-08-29T06:24:44.797Z
Learning: In Fluent Bit, ZSTD support is always available and enabled by default. The build system automatically detects and uses either the system libzstd library or builds the bundled ZSTD version. Unlike other optional dependencies like Arrow which use conditional compilation guards (e.g., FLB_HAVE_ARROW), ZSTD does not require conditional includes or build flags.

Applied to files:

  • .github/workflows/unit-tests.yaml
  • src/aws/flb_aws_compress.c
  • CMakeLists.txt
  • src/CMakeLists.txt
📚 Learning: 2025-09-08T11:21:33.975Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 10851
File: include/fluent-bit/flb_simd.h:60-66
Timestamp: 2025-09-08T11:21:33.975Z
Learning: Fluent Bit currently only supports MSVC compiler on Windows, so additional compiler compatibility guards may be unnecessary for Windows-specific code paths.

Applied to files:

  • .github/workflows/unit-tests.yaml
  • CMakeLists.txt
  • src/CMakeLists.txt
📚 Learning: 2025-11-21T06:23:29.770Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 11171
File: include/fluent-bit/flb_lib.h:52-53
Timestamp: 2025-11-21T06:23:29.770Z
Learning: In Fluent Bit core (fluent/fluent-bit repository), function descriptions/documentation are not required for newly added functions in header files.

Applied to files:

  • examples/s3_output/README.md
📚 Learning: 2025-08-29T06:25:02.561Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:7-7
Timestamp: 2025-08-29T06:25:02.561Z
Learning: In Fluent Bit, ZSTD (zstandard) compression library is bundled directly in the source tree at `lib/zstd-1.5.7` and is built unconditionally as a static library. Unlike optional external dependencies, ZSTD does not use conditional compilation guards like `FLB_HAVE_ZSTD` and is always available. Headers like `<fluent-bit/flb_zstd.h>` can be included directly without guards.

Applied to files:

  • include/fluent-bit/aws/flb_aws_compress.h
  • src/aws/flb_aws_compress.c
  • CMakeLists.txt
  • src/CMakeLists.txt
  • plugins/out_s3/s3.c
📚 Learning: 2025-12-22T05:39:02.291Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 11250
File: lib/ripser-1.2.1/Makefile:1-18
Timestamp: 2025-12-22T05:39:02.291Z
Learning: In fluent/fluent-bit, the Makefile at lib/ripser-1.2.1/Makefile is imported from upstream Ripser and is not used in the actual build process. The project uses CMake for building (lib/ripser-1.2.1/CMakeLists.txt), so changes to the imported Makefile are not necessary.

Applied to files:

  • src/CMakeLists.txt
📚 Learning: 2025-09-14T09:46:09.531Z
Learnt from: aminvakil
Repo: fluent/fluent-bit PR: 10844
File: conf/fluent-bit:13-15
Timestamp: 2025-09-14T09:46:09.531Z
Learning: For fluent-bit Debian packaging, /opt/fluent-bit/bin/ is the appropriate installation path since the package may be installed from non-official Debian sources, making /opt compliant with FHS for optional software packages.

Applied to files:

  • src/CMakeLists.txt
📚 Learning: 2025-09-04T07:28:37.083Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 10832
File: src/wasm/CMakeLists.txt:112-131
Timestamp: 2025-09-04T07:28:37.083Z
Learning: In fluent-bit CMake files, the user cosmo0920 prefers treating Git as a command rather than a package, emphasizing that Git is not a pkg-config retrievable package but just a command.

Applied to files:

  • src/CMakeLists.txt
📚 Learning: 2025-09-04T12:35:22.872Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10825
File: plugins/out_s3/s3.c:1339-1344
Timestamp: 2025-09-04T12:35:22.872Z
Learning: In the Fluent Bit S3 plugin, the user prefers to maintain current retry_limit behavior without special handling for FLB_OUT_RETRY_UNLIMITED (-1), as there's no documentation indicating -1 should be used for infinite retries and consistency with current logic is preferred.

Applied to files:

  • plugins/out_s3/s3.c
🧬 Code graph analysis (2)
plugins/out_s3/s3.h (1)
plugins/out_s3/s3.c (1)
  • s3_put_object (1663-1800)
plugins/out_s3/s3.c (6)
src/flb_output.c (1)
  • flb_output_get_property (1108-1111)
src/aws/flb_aws_compress.c (2)
  • flb_aws_compression_get_type (80-97)
  • flb_aws_compression_compress (99-116)
src/flb_sds.c (2)
  • flb_sds_create_len (58-76)
  • flb_sds_destroy (389-399)
include/fluent-bit/flb_mem.h (1)
  • flb_free (126-128)
plugins/out_s3/s3_multipart.c (1)
  • upload_part (685-799)
src/flb_pack.c (1)
  • flb_pack_msgpack_to_json_format (1169-1450)
🪛 Cppcheck (2.19.0)
plugins/out_s3/s3.c

[error] 791-791: failed to expand 'flb_plg_error', it is invalid to use a preprocessor directive as macro parameter

(syntaxError)

🪛 LanguageTool
examples/s3_output/README.md

[grammar] ~56-~56: Ensure spelling is correct
Context: ..., 201, 404, 500, 401) - Response times (2ms to 5000ms) - Client information (IP, us...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[grammar] ~465-~465: Ensure spelling is correct
Context: ...les uploaded when reaching 50MB (before 60min timeout) Verification Steps: 1. R...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🪛 markdownlint-cli2 (0.18.1)
examples/s3_output/README.md

112-112: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


177-177: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


270-270: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🔇 Additional comments (18)
plugins/out_kinesis_firehose/firehose.c (1)

496-501: LGTM! Documentation updated to reflect new compression options.

The help text correctly lists the now-supported compression algorithms: gzip, snappy, and zstd. This aligns with the changes in flb_aws_compress.c that add Snappy support and remove Arrow/Parquet compression options.

src/aws/flb_aws_compress.c (2)

31-37: LGTM! Snappy wrapper correctly adapts the function signature.

The wrapper properly casts void * to char * as required by flb_snappy_compress, and the return value is passed through directly.


57-78: Well-structured separation of compression algorithms from format converters.

The updated compression_options table cleanly includes only true compression algorithms (gzip, snappy, zstd), with clear documentation explaining that Arrow/Parquet are format converters handled separately via the format parameter.

include/fluent-bit/aws/flb_aws_compress.h (1)

24-46: Well-documented deprecation and backward compatibility approach.

The header cleanly separates compression algorithms from format converters, retains deprecated values for backward compatibility, and provides clear guidance to use format=parquet instead of compression=parquet. Based on learnings, this follows the established pattern where Arrow/Parquet use conditional compilation guards unlike ZSTD which is always available.

examples/s3_output/fluent-bit-s3-parquet.conf (2)

143-145: Parquet schema correctly matches the access log data structure.

The schema_str defines fields that align with the dummy access log inputs: timestamp, level, message, method, path, status, duration_ms, client_ip, and user_agent. Using int64 for timestamp and status, and float64 for duration_ms are appropriate choices for avoiding overflow issues as noted in the README.


1-117: Comprehensive and well-documented test configuration.

The configuration provides excellent coverage of both upload scenarios (PutObject vs Multipart) with realistic traffic patterns. The inline comments clearly explain the rationale for each setting, making this a valuable reference for users.

.github/workflows/unit-tests.yaml (2)

56-57: LGTM! Build flag renamed to reflect Parquet encoder focus.

The change from -DFLB_ARROW=On to -DFLB_PARQUET_ENCODER=On correctly reflects the shift from Arrow-based compression to Parquet encoder support.


117-125: Apache Arrow C++ package names are correct.

The packages libarrow-dev and libparquet-dev are the correct C++ development packages for Apache Arrow and Parquet, providing the required headers and libraries for the flb_parquet_impl.cpp implementation.

plugins/out_s3/s3.h (3)

196-199: New Parquet format fields properly integrated into struct.

The format and schema_str fields provide clean support for runtime format selection and user-defined Parquet schemas. The placement at the end of the struct with clear section comments improves readability.


201-203: LGTM! Format type macros clearly defined.

The FLB_S3_FORMAT_JSON (0) and FLB_S3_FORMAT_PARQUET (1) macros provide a clean enum-like interface for format selection.


233-235: Function signature change verified as consistent across the codebase.

All s3_put_object call sites (s3.c:1330 and s3.c:1494) have been properly updated to use the new flb_sds_t buffer parameter. The declaration in s3.h and implementation in s3.c both align with the new signature.

src/CMakeLists.txt (2)

157-166: Parquet encoder source and dependency wiring look consistent

flb_parquet_impl.cpp is only added when FLB_PARQUET_ENCODER is enabled and is explicitly marked as C++ via set_source_files_properties, with Arrow/Parquet libs added to FLB_DEPS under the same flag. This keeps Parquet support fully optional and avoids impacting non-Parquet builds.

Also applies to: 398-405


525-528: SASL linkage for Kafka/MSK is correctly scoped to SASL-enabled builds

Conditionally linking sasl2 into fluent-bit-shared and fluent-bit-bin under FLB_SASL_ENABLED matches how Kafka/MSK SASL is detected and avoids dragging SASL into builds that don’t need it.

Also applies to: 600-606

CMakeLists.txt (1)

200-202: Parquet encoder option and C++17 enabling are wired correctly

FLB_PARQUET_ENCODER is now a dedicated option, enabled by FLB_ALL, and the Parquet block:

  • Calls enable_language(CXX) and sets CMAKE_CXX_STANDARD 17 when the encoder is requested.
  • Delegates Arrow/Parquet discovery to cmake/parquet.cmake.
  • Only defines FLB_HAVE_PARQUET_ENCODER and adds include directories when both Arrow and Parquet are found, otherwise it warns and flips FLB_PARQUET_ENCODER back to OFF.

This keeps Parquet strictly opt‑in and prevents partial configuration when dependencies are missing.

Also applies to: 278-287, 1311-1329

plugins/out_s3/s3.c (3)

97-121: Header handling for compression and Parquet looks correct

The new zstd Content-Encoding header and the guards in create_headers() to skip Content-Encoding when ctx->format == FLB_S3_FORMAT_PARQUET align with the intended behavior:

  • GZIP/ZSTD still get the appropriate header for JSON output.
  • Parquet relies on its internal codec only, without an outer content‑encoding header.

Also applies to: 183-239


1516-1661: SDS-based request/flush pipeline is a nice cleanup

The move to flb_sds_t for all request buffers (construct_request_buffer(), upload_data(), send_upload_request(), put_all_chunks(), cb_s3_upload(), cb_s3_flush()) significantly simplifies memory management:

  • All conversions (JSON / log_key / Parquet) produce SDS and are freed in a single place per call chain.
  • construct_request_buffer() now cleanly owns and frees the intermediate msgpack buffer and unlocks chunks on all error paths.
  • Callers always destroy the returned SDS (buffer) after upload_data()/s3_put_object() regardless of success or failure, avoiding mixed free() vs flb_sds_destroy() semantics.

This should make future format additions (or tweaks to Parquet encoding) much less error‑prone.

Also applies to: 1943-2015, 3440-3488, 3953-4068


4210-4231: Config map docs for compression/format/schema/upload_timeout/auth look coherent

The updated config map entries:

  • Clarify that compression supports none|gzip|snappy|zstd and that arrow|parquet are deprecated legacy values.
  • Introduce an explicit format option (json / parquet) and schema_str for Parquet, with docs that match the runtime behavior.
  • Document upload_timeout semantics more clearly and add authorization_endpoint_bearer_token.

These descriptions match the new logic in cb_s3_init() and construct_request_buffer().

Also applies to: 4382-4413

cmake/parquet.cmake (1)

249-258: No action needed—the URLs in the code already match the official Apache Arrow documentation.

The URLs currently use https://packages.apache.org/artifactory/arrow/…, which is the correct format per official Apache Arrow installation guides. The /artifactory/arrow/ segment is not obsolete; it is the official repository path for both APT and RPM packages. Removing it would break the URLs.

Likely an incorrect or invalid review comment.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (4)
examples/s3_output/README.md (1)

112-114: Consider adding language specifiers to fenced code blocks.

Several code blocks in this documentation lack language specifiers, which can affect syntax highlighting. Consider adding appropriate language identifiers (e.g., text, ini, json) for better readability.

tests/internal/parquet.c (1)

27-31: Consider including the header instead of redefining compression constants.

These compression type constants are duplicated from flb_aws_compress.h. If the values change in the source header, these tests will silently use incorrect values.

🔎 Proposed fix
-/* Compression types from flb_aws_compress.h */
-#define FLB_AWS_COMPRESS_NONE    0
-#define FLB_AWS_COMPRESS_GZIP    1
-#define FLB_AWS_COMPRESS_SNAPPY  2
-#define FLB_AWS_COMPRESS_ZSTD    3
+#include <fluent-bit/aws/flb_aws_compress.h>

If the header cannot be included due to dependencies, consider adding a comment noting that these values must be kept in sync with the source file.

src/flb_parquet_impl.cpp (2)

178-197: Unnecessary try/catch block around strtol.

The try/catch block on lines 195-197 is unnecessary since strtol is a C function that does not throw exceptions. The errno and endptr checks already handle all error cases. The same pattern appears in other string parsing functions.

🔎 Proposed fix
         else if (obj->type == MSGPACK_OBJECT_STR) {
-            try {
                 std::string str(obj->via.str.ptr, obj->via.str.size);
                 char* endptr;
                 errno = 0;
                 long val = std::strtol(str.c_str(), &endptr, 10);

                 if (endptr == str.c_str() || *endptr != '\0' || errno == ERANGE) {
                     return arrow::Status::Invalid("Cannot parse string to int32");
                 }

                 if (val > INT32_MAX || val < INT32_MIN) {
                     int32_overflow_by_field[field_name]++;
                     return builder->Append(val > INT32_MAX ? INT32_MAX : INT32_MIN);
                 }

                 string_parsed_to_int_by_field[field_name]++;
                 return builder->Append(static_cast<int32_t>(val));
-            } catch (...) {
-                return arrow::Status::Invalid("Cannot parse string to int32");
-            }
         }

Note: The std::string constructor can throw std::bad_alloc, so if you want to be defensive against memory allocation failures, you could keep a narrower catch or let it propagate to the outer try/catch in flb_msgpack_raw_to_parquet.


619-622: Prefer ValueUnsafe() or check status before ValueOrDie().

ValueOrDie() will terminate the program if the status is not OK, but here the status is already checked on line 618. Using ValueUnsafe() or MoveValueUnsafe() would be more semantically correct since you've already verified success.

🔎 Proposed fix
         if (!schema_result.ok()) {
             flb_error("[parquet] Failed to parse schema: %s\n", schema_result.status().ToString().c_str());
             return NULL;
         }
-        auto schema = schema_result.ValueOrDie();
+        auto schema = std::move(schema_result).ValueUnsafe();

This pattern should be applied to other ValueOrDie() calls where the status is already checked (lines 635, 859, 877, 899, 922).

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between eff671c and c99a15c.

📒 Files selected for processing (8)
  • CMakeLists.txt
  • cmake/kafka.cmake
  • examples/s3_output/README.md
  • lib/avro/CMakeLists.txt
  • lib/jansson-e23f558/CMakeLists.txt
  • src/flb_parquet_impl.cpp
  • tests/internal/CMakeLists.txt
  • tests/internal/parquet.c
🧰 Additional context used
🧠 Learnings (11)
📓 Common learnings
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: src/aws/flb_aws_compress.c:52-56
Timestamp: 2025-08-29T06:24:55.855Z
Learning: ZSTD compression is always available in Fluent Bit and does not require conditional compilation guards. Unlike Arrow/Parquet which use #ifdef FLB_HAVE_ARROW guards, ZSTD is built unconditionally with flb_zstd.c included directly in src/CMakeLists.txt and a bundled ZSTD library at lib/zstd-1.5.7/.
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:93-107
Timestamp: 2025-08-29T06:25:27.250Z
Learning: In Fluent Bit, ZSTD compression is enabled by default and is treated as a core dependency, not requiring conditional compilation guards like `#ifdef FLB_HAVE_ZSTD`. Unlike some other optional components such as ARROW/PARQUET (which use `#ifdef FLB_HAVE_ARROW` guards), ZSTD support is always available and doesn't need build-time conditionals. ZSTD headers are included directly without guards across multiple plugins and core components.
📚 Learning: 2025-11-21T06:23:29.770Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 11171
File: include/fluent-bit/flb_lib.h:52-53
Timestamp: 2025-11-21T06:23:29.770Z
Learning: In Fluent Bit core (fluent/fluent-bit repository), function descriptions/documentation are not required for newly added functions in header files.

Applied to files:

  • examples/s3_output/README.md
📚 Learning: 2025-08-31T12:46:11.940Z
Learnt from: ThomasDevoogdt
Repo: fluent/fluent-bit PR: 9277
File: .github/workflows/pr-compile-check.yaml:147-151
Timestamp: 2025-08-31T12:46:11.940Z
Learning: In fluent-bit CMakeLists.txt, the system library preference flags are defined as FLB_PREFER_SYSTEM_LIB_ZSTD and FLB_PREFER_SYSTEM_LIB_KAFKA with the FLB_ prefix.

Applied to files:

  • CMakeLists.txt
  • cmake/kafka.cmake
📚 Learning: 2025-08-29T06:25:27.250Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:93-107
Timestamp: 2025-08-29T06:25:27.250Z
Learning: In Fluent Bit, ZSTD compression is enabled by default and is treated as a core dependency, not requiring conditional compilation guards like `#ifdef FLB_HAVE_ZSTD`. Unlike some other optional components such as ARROW/PARQUET (which use `#ifdef FLB_HAVE_ARROW` guards), ZSTD support is always available and doesn't need build-time conditionals. ZSTD headers are included directly without guards across multiple plugins and core components.

Applied to files:

  • CMakeLists.txt
  • cmake/kafka.cmake
  • src/flb_parquet_impl.cpp
📚 Learning: 2025-08-29T06:24:44.797Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: src/aws/flb_aws_compress.c:26-26
Timestamp: 2025-08-29T06:24:44.797Z
Learning: In Fluent Bit, ZSTD support is always available and enabled by default. The build system automatically detects and uses either the system libzstd library or builds the bundled ZSTD version. Unlike other optional dependencies like Arrow which use conditional compilation guards (e.g., FLB_HAVE_ARROW), ZSTD does not require conditional includes or build flags.

Applied to files:

  • CMakeLists.txt
📚 Learning: 2025-08-31T12:46:11.940Z
Learnt from: ThomasDevoogdt
Repo: fluent/fluent-bit PR: 9277
File: .github/workflows/pr-compile-check.yaml:147-151
Timestamp: 2025-08-31T12:46:11.940Z
Learning: In fluent-bit, the correct CMake flag for using system librdkafka is `FLB_PREFER_SYSTEM_LIB_KAFKA=ON`.

Applied to files:

  • CMakeLists.txt
  • cmake/kafka.cmake
📚 Learning: 2025-08-29T06:24:55.855Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: src/aws/flb_aws_compress.c:52-56
Timestamp: 2025-08-29T06:24:55.855Z
Learning: ZSTD compression is always available in Fluent Bit and does not require conditional compilation guards. Unlike Arrow/Parquet which use #ifdef FLB_HAVE_ARROW guards, ZSTD is built unconditionally with flb_zstd.c included directly in src/CMakeLists.txt and a bundled ZSTD library at lib/zstd-1.5.7/.

Applied to files:

  • CMakeLists.txt
  • cmake/kafka.cmake
  • src/flb_parquet_impl.cpp
📚 Learning: 2025-08-29T06:25:27.250Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:93-107
Timestamp: 2025-08-29T06:25:27.250Z
Learning: In Fluent Bit, ZSTD compression is enabled by default and is treated as a core dependency, not requiring conditional compilation guards like `#ifdef FLB_HAVE_ZSTD`. Unlike some other optional components, ZSTD support is always available and doesn't need build-time conditionals.

Applied to files:

  • CMakeLists.txt
📚 Learning: 2025-08-29T06:24:26.170Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:39-42
Timestamp: 2025-08-29T06:24:26.170Z
Learning: In Fluent Bit, ZSTD compression support is enabled by default and does not require conditional compilation guards (like #ifdef FLB_HAVE_ZSTD) around ZSTD-related code declarations and implementations.

Applied to files:

  • CMakeLists.txt
📚 Learning: 2025-08-29T06:25:02.561Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:7-7
Timestamp: 2025-08-29T06:25:02.561Z
Learning: In Fluent Bit, ZSTD (zstandard) compression library is bundled directly in the source tree at `lib/zstd-1.5.7` and is built unconditionally as a static library. Unlike optional external dependencies, ZSTD does not use conditional compilation guards like `FLB_HAVE_ZSTD` and is always available. Headers like `<fluent-bit/flb_zstd.h>` can be included directly without guards.

Applied to files:

  • CMakeLists.txt
📚 Learning: 2025-09-08T11:21:33.975Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 10851
File: include/fluent-bit/flb_simd.h:60-66
Timestamp: 2025-09-08T11:21:33.975Z
Learning: Fluent Bit currently only supports MSVC compiler on Windows, so additional compiler compatibility guards may be unnecessary for Windows-specific code paths.

Applied to files:

  • CMakeLists.txt
🧬 Code graph analysis (1)
tests/internal/parquet.c (1)
src/flb_parquet_impl.cpp (2)
  • flb_msgpack_raw_to_parquet (579-952)
  • flb_msgpack_raw_to_parquet (579-582)
🪛 LanguageTool
examples/s3_output/README.md

[grammar] ~56-~56: Ensure spelling is correct
Context: ..., 201, 404, 500, 401) - Response times (2ms to 5000ms) - Client information (IP, us...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[grammar] ~737-~737: Ensure spelling is correct
Context: ...les uploaded when reaching 50MB (before 60min timeout) Verification Steps: 1. R...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🪛 markdownlint-cli2 (0.18.1)
examples/s3_output/README.md

112-112: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


177-177: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


286-286: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


294-294: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


305-305: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


315-315: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


328-328: Tables should be surrounded by blank lines

(MD058, blanks-around-tables)


344-344: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


362-362: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


436-436: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


447-447: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


542-542: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🔇 Additional comments (10)
lib/avro/CMakeLists.txt (1)

19-19: CMake version bump aligns with project-wide requirements.

The minimum CMake version increase to 3.12 is consistent with modern CMake practices and aligns with the Parquet/Arrow integration requirements introduced in this PR.

lib/jansson-e23f558/CMakeLists.txt (1)

567-569: Test target disabled with documented rationale.

Commenting out the check target to avoid conflicts with other libraries is acceptable. The comment explains the rationale clearly.

cmake/kafka.cmake (1)

35-42: SASL configuration logic is well-structured.

The changes correctly enable built-in SASL features (PLAIN/SCRAM/OAUTHBEARER) unconditionally while making Cyrus-SASL optional for GSSAPI/Kerberos. The comments clearly explain the rationale for each setting.

tests/internal/parquet.c (1)

74-103: Comprehensive test coverage with good structure.

The test suite is well-organized into logical categories with good coverage of edge cases, type conversions, and error handling. The consistent pattern of setup/execute/verify/cleanup is followed throughout.

CMakeLists.txt (2)

284-287: Good documentation for FLB_PARQUET_ENCODER exclusion from FLB_ALL.

The comment explaining that Parquet encoder requires explicit enabling due to external dependencies is helpful for users and maintainers.


1337-1355: Parquet encoder integration is well-structured.

The Parquet encoder block properly:

  • Enables C++ with C++17 standard (required for Arrow/Parquet)
  • Delegates discovery to cmake/parquet.cmake, which comprehensively handles detection across platforms with three fallback methods (pkg-config, CMake Config, manual search)
  • Gracefully disables the feature with clear warnings if dependencies are not found

The cmake/parquet.cmake module correctly exports all required variables (ARROW_FOUND, PARQUET_FOUND, ARROW_INCLUDE_DIRS, PARQUET_INCLUDE_DIRS, ARROW_LIBRARIES, PARQUET_LIBRARIES) to the parent scope, with robust platform-specific library search paths, version validation, and helpful installation instructions.

src/flb_parquet_impl.cpp (4)

39-42: Forward declaration approach is reasonable.

The forward declaration of flb_msgpack_to_json_str to avoid C++ compilation conflicts with mpack.h is a pragmatic solution. Consider adding a brief comment explaining why the full header cannot be included.


650-727: Record processing logic is well-designed.

The on-the-fly processing approach addresses the previous review comment about dangling pointers. The lambda properly handles:

  • Non-map records by appending nulls
  • Missing fields with nullable vs non-nullable handling
  • Conversion failures with appropriate fallbacks
  • Statistics tracking for data quality monitoring

939-951: Exception handling properly cleans up resources.

The catch blocks correctly free output_buffer if it was allocated before the exception. However, note that output_buffer is only assigned on line 926, so the null check is appropriate but the variable could be initialized to nullptr for clarity (which it is on line 586).


863-865: RecordBatch creation uses correct row count.

The record_count variable correctly tracks the number of processed records and is used when creating the RecordBatch, ensuring consistency between the arrays and the batch metadata.

@kalavt kalavt force-pushed the feature/aws-s3-parquet branch 2 times, most recently from a409e9f to 37a602a Compare December 25, 2025 07:28
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
examples/s3_output/README.md (1)

486-486: Typo already flagged in previous review.

This issue was previously identified: "2020-203c" should be "2020-2030".

tests/internal/CMakeLists.txt (1)

136-141: Missing Parquet library linkage will cause linker errors.

The Parquet test file is correctly added to UNIT_TESTS_FILES, but the prepare_unit_tests function (lines 234-236) only links Avro libraries when FLB_AVRO_ENCODER is enabled. There's no corresponding linkage for ARROW_LIBRARIES and PARQUET_LIBRARIES when FLB_PARQUET_ENCODER is enabled.

🔎 Proposed fix in prepare_unit_tests function

Add the following after line 236:

       if(FLB_AVRO_ENCODER)
         target_link_libraries(${source_file_we} avro-static jansson)
       endif()
+
+      if(FLB_PARQUET_ENCODER)
+        target_link_libraries(${source_file_we} ${ARROW_LIBRARIES} ${PARQUET_LIBRARIES})
+      endif()

       add_test(NAME ${source_file_we}
🧹 Nitpick comments (5)
examples/s3_output/README.md (2)

465-465: Fix spacing in time reference.

Line 465 uses "60min" which should be "60-minute" or "60 minutes" for consistency with similar references elsewhere in the document (e.g., "5 minutes" at line 127).

🔎 Proposed fix
- **Fallback**: `upload_timeout=60m` (acts as safety net)
+ **Fallback**: `upload_timeout=60m` (acts as safety net for 60 minutes)

112-112: Add language identifiers to fenced code blocks.

Markdown linting expects language specifications on all code blocks for syntax highlighting and consistency. Add language identifiers to the three blocks without them:

  • Line 112: Plain-text block showing a formula/logic
  • Line 177: S3 file path tree structure
  • Line 270: Log output example
🔎 Proposed fixes

At line 112, specify a language (or use plain):

-```
+```text
 Upload Trigger = upload_timeout reached OR total_file_size reached
-```
+```

At line 177, specify a language for the tree structure:

-```
+```
 s3://s3-bucket-kafka-sink/fluent-bit-logs/
 ├── putobject/app/day=YYYYMMDD/      # Case 1: Small files via PutObject
 └── multipart/access/day=YYYYMMDD/   # Case 2: Large files via Multipart Upload
-```
+```

At line 270, specify a language for log output:

-```
+```log
 [parquet] Data quality summary for 53206 records:
 [parquet] Missing non-nullable fields (defaults used):
 [parquet]   field='level' count=52506
 [parquet]   field='timestamp' count=52506
-```
+```

Also applies to: 177-177, 270-270

include/fluent-bit/flb_parquet.h (1)

23-25: Consider removing unused includes.

The flb_sds.h include appears unnecessary since the API uses void* and size_t rather than flb_sds_t. The msgpack.h include is also not required in the public header since input is typed as const void*. Only stdlib.h (for size_t) is strictly needed.

Removing unused includes reduces header dependencies and compile times.

🔎 Proposed fix
-#include <fluent-bit/flb_sds.h>
-#include <msgpack.h>
 #include <stdlib.h>
cmake/parquet.cmake (1)

222-228: Unicode symbols may not render on all terminals.

The checkmarks (, ) in status messages are visually helpful but may not display correctly on terminals without UTF-8 support or in CI logs. Consider using ASCII alternatives like [OK] / [FAIL] for broader compatibility.

🔎 Alternative using ASCII
 if(ARROW_FOUND AND PARQUET_FOUND)
-  message(STATUS "✓ Arrow found: ${ARROW_LIBRARIES}")
+  message(STATUS "[OK] Arrow found: ${ARROW_LIBRARIES}")
   message(STATUS "  Include dirs: ${ARROW_INCLUDE_DIRS}")
-  message(STATUS "✓ Parquet found: ${PARQUET_LIBRARIES}")
+  message(STATUS "[OK] Parquet found: ${PARQUET_LIBRARIES}")
   message(STATUS "  Include dirs: ${PARQUET_INCLUDE_DIRS}")
 else()
-  message(STATUS "✗ Arrow/Parquet not found")
+  message(STATUS "[NOT FOUND] Arrow/Parquet not found")
src/flb_parquet_impl.cpp (1)

51-61: Consider documenting the buffer size parameter.

The 256 parameter in flb_msgpack_to_json_str appears to be an initial buffer size hint. Adding a brief comment or using a named constant would improve code clarity.

🔎 Suggested improvement
 std::string msgpack_object_to_json_string(const msgpack_object* obj) {
     /* Use Fluent Bit's built-in msgpack to JSON converter */
-    char *json_str = flb_msgpack_to_json_str(256, obj, FLB_FALSE);
+    /* Initial buffer size hint for JSON conversion */
+    char *json_str = flb_msgpack_to_json_str(256, obj, FLB_FALSE);
     if (!json_str) {
         return "{}";
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c99a15c and a409e9f.

📒 Files selected for processing (25)
  • .github/workflows/unit-tests.yaml
  • CMakeLists.txt
  • cmake/kafka.cmake
  • cmake/parquet.cmake
  • examples/s3_output/README.md
  • examples/s3_output/athena-queries.sql
  • examples/s3_output/fluent-bit-s3-parquet.conf
  • examples/s3_output/run-s3-parquet-test.sh
  • include/fluent-bit/aws/flb_aws_compress.h
  • include/fluent-bit/flb_parquet.h
  • lib/avro/CMakeLists.txt
  • lib/jansson-e23f558/CMakeLists.txt
  • plugins/out_kinesis_firehose/firehose.c
  • plugins/out_s3/s3.c
  • plugins/out_s3/s3.h
  • src/CMakeLists.txt
  • src/aws/CMakeLists.txt
  • src/aws/compression/CMakeLists.txt
  • src/aws/compression/arrow/CMakeLists.txt
  • src/aws/compression/arrow/compress.c
  • src/aws/compression/arrow/compress.h
  • src/aws/flb_aws_compress.c
  • src/flb_parquet_impl.cpp
  • tests/internal/CMakeLists.txt
  • tests/internal/parquet.c
💤 Files with no reviewable changes (5)
  • src/aws/CMakeLists.txt
  • src/aws/compression/arrow/CMakeLists.txt
  • src/aws/compression/arrow/compress.c
  • src/aws/compression/arrow/compress.h
  • src/aws/compression/CMakeLists.txt
🚧 Files skipped from review as they are similar to previous changes (6)
  • examples/s3_output/fluent-bit-s3-parquet.conf
  • .github/workflows/unit-tests.yaml
  • examples/s3_output/athena-queries.sql
  • plugins/out_kinesis_firehose/firehose.c
  • src/CMakeLists.txt
  • examples/s3_output/run-s3-parquet-test.sh
🧰 Additional context used
🧠 Learnings (13)
📓 Common learnings
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: src/aws/flb_aws_compress.c:52-56
Timestamp: 2025-08-29T06:24:55.855Z
Learning: ZSTD compression is always available in Fluent Bit and does not require conditional compilation guards. Unlike Arrow/Parquet which use #ifdef FLB_HAVE_ARROW guards, ZSTD is built unconditionally with flb_zstd.c included directly in src/CMakeLists.txt and a bundled ZSTD library at lib/zstd-1.5.7/.
📚 Learning: 2025-11-21T06:23:29.770Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 11171
File: include/fluent-bit/flb_lib.h:52-53
Timestamp: 2025-11-21T06:23:29.770Z
Learning: In Fluent Bit core (fluent/fluent-bit repository), function descriptions/documentation are not required for newly added functions in header files.

Applied to files:

  • include/fluent-bit/flb_parquet.h
  • examples/s3_output/README.md
📚 Learning: 2025-08-29T06:25:27.250Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:93-107
Timestamp: 2025-08-29T06:25:27.250Z
Learning: In Fluent Bit, ZSTD compression is enabled by default and is treated as a core dependency, not requiring conditional compilation guards like `#ifdef FLB_HAVE_ZSTD`. Unlike some other optional components such as ARROW/PARQUET (which use `#ifdef FLB_HAVE_ARROW` guards), ZSTD support is always available and doesn't need build-time conditionals. ZSTD headers are included directly without guards across multiple plugins and core components.

Applied to files:

  • src/aws/flb_aws_compress.c
  • include/fluent-bit/aws/flb_aws_compress.h
  • cmake/kafka.cmake
  • plugins/out_s3/s3.c
  • CMakeLists.txt
📚 Learning: 2025-08-29T06:24:55.855Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: src/aws/flb_aws_compress.c:52-56
Timestamp: 2025-08-29T06:24:55.855Z
Learning: ZSTD compression is always available in Fluent Bit and does not require conditional compilation guards. Unlike Arrow/Parquet which use #ifdef FLB_HAVE_ARROW guards, ZSTD is built unconditionally with flb_zstd.c included directly in src/CMakeLists.txt and a bundled ZSTD library at lib/zstd-1.5.7/.

Applied to files:

  • src/aws/flb_aws_compress.c
  • include/fluent-bit/aws/flb_aws_compress.h
  • src/flb_parquet_impl.cpp
  • plugins/out_s3/s3.c
  • CMakeLists.txt
📚 Learning: 2025-08-29T06:24:26.170Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:39-42
Timestamp: 2025-08-29T06:24:26.170Z
Learning: In Fluent Bit, ZSTD compression support is enabled by default and does not require conditional compilation guards (like #ifdef FLB_HAVE_ZSTD) around ZSTD-related code declarations and implementations.

Applied to files:

  • src/aws/flb_aws_compress.c
  • include/fluent-bit/aws/flb_aws_compress.h
  • plugins/out_s3/s3.c
  • CMakeLists.txt
📚 Learning: 2025-08-29T06:25:27.250Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:93-107
Timestamp: 2025-08-29T06:25:27.250Z
Learning: In Fluent Bit, ZSTD compression is enabled by default and is treated as a core dependency, not requiring conditional compilation guards like `#ifdef FLB_HAVE_ZSTD`. Unlike some other optional components, ZSTD support is always available and doesn't need build-time conditionals.

Applied to files:

  • src/aws/flb_aws_compress.c
  • include/fluent-bit/aws/flb_aws_compress.h
  • plugins/out_s3/s3.c
  • CMakeLists.txt
📚 Learning: 2025-08-29T06:25:02.561Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:7-7
Timestamp: 2025-08-29T06:25:02.561Z
Learning: In Fluent Bit, ZSTD (zstandard) compression library is bundled directly in the source tree at `lib/zstd-1.5.7` and is built unconditionally as a static library. Unlike optional external dependencies, ZSTD does not use conditional compilation guards like `FLB_HAVE_ZSTD` and is always available. Headers like `<fluent-bit/flb_zstd.h>` can be included directly without guards.

Applied to files:

  • src/aws/flb_aws_compress.c
  • include/fluent-bit/aws/flb_aws_compress.h
  • plugins/out_s3/s3.c
  • CMakeLists.txt
📚 Learning: 2025-08-29T06:24:44.797Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: src/aws/flb_aws_compress.c:26-26
Timestamp: 2025-08-29T06:24:44.797Z
Learning: In Fluent Bit, ZSTD support is always available and enabled by default. The build system automatically detects and uses either the system libzstd library or builds the bundled ZSTD version. Unlike other optional dependencies like Arrow which use conditional compilation guards (e.g., FLB_HAVE_ARROW), ZSTD does not require conditional includes or build flags.

Applied to files:

  • src/aws/flb_aws_compress.c
  • CMakeLists.txt
📚 Learning: 2025-08-31T12:46:11.940Z
Learnt from: ThomasDevoogdt
Repo: fluent/fluent-bit PR: 9277
File: .github/workflows/pr-compile-check.yaml:147-151
Timestamp: 2025-08-31T12:46:11.940Z
Learning: In fluent-bit, the correct CMake flag for using system librdkafka is `FLB_PREFER_SYSTEM_LIB_KAFKA=ON`.

Applied to files:

  • cmake/kafka.cmake
  • CMakeLists.txt
  • cmake/parquet.cmake
📚 Learning: 2025-08-31T12:46:11.940Z
Learnt from: ThomasDevoogdt
Repo: fluent/fluent-bit PR: 9277
File: .github/workflows/pr-compile-check.yaml:147-151
Timestamp: 2025-08-31T12:46:11.940Z
Learning: In fluent-bit CMakeLists.txt, the system library preference flags are defined as FLB_PREFER_SYSTEM_LIB_ZSTD and FLB_PREFER_SYSTEM_LIB_KAFKA with the FLB_ prefix.

Applied to files:

  • cmake/kafka.cmake
  • CMakeLists.txt
  • cmake/parquet.cmake
📚 Learning: 2025-09-04T12:35:22.872Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10825
File: plugins/out_s3/s3.c:1339-1344
Timestamp: 2025-09-04T12:35:22.872Z
Learning: In the Fluent Bit S3 plugin, the user prefers to maintain current retry_limit behavior without special handling for FLB_OUT_RETRY_UNLIMITED (-1), as there's no documentation indicating -1 should be used for infinite retries and consistency with current logic is preferred.

Applied to files:

  • plugins/out_s3/s3.c
📚 Learning: 2025-09-08T11:21:33.975Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 10851
File: include/fluent-bit/flb_simd.h:60-66
Timestamp: 2025-09-08T11:21:33.975Z
Learning: Fluent Bit currently only supports MSVC compiler on Windows, so additional compiler compatibility guards may be unnecessary for Windows-specific code paths.

Applied to files:

  • CMakeLists.txt
📚 Learning: 2025-08-07T10:15:46.187Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 10691
File: .github/workflows/unit-tests.yaml:94-101
Timestamp: 2025-08-07T10:15:46.187Z
Learning: Apache Arrow latest packages have changing checksums, making hardcoded checksum verification impractical for CI workflows that download the latest version.

Applied to files:

  • cmake/parquet.cmake
🧬 Code graph analysis (6)
plugins/out_s3/s3.h (1)
plugins/out_s3/s3.c (1)
  • s3_put_object (1663-1800)
include/fluent-bit/flb_parquet.h (1)
src/flb_parquet_impl.cpp (2)
  • flb_msgpack_raw_to_parquet (579-952)
  • flb_msgpack_raw_to_parquet (579-582)
src/aws/flb_aws_compress.c (2)
src/flb_gzip.c (1)
  • flb_gzip_compress (157-252)
src/flb_zstd.c (1)
  • flb_zstd_compress (33-57)
src/flb_parquet_impl.cpp (1)
include/fluent-bit/flb_mem.h (1)
  • flb_free (126-128)
plugins/out_s3/s3.c (5)
src/aws/flb_aws_compress.c (2)
  • flb_aws_compression_get_type (80-97)
  • flb_aws_compression_compress (99-116)
src/flb_sds.c (2)
  • flb_sds_create_len (58-76)
  • flb_sds_destroy (389-399)
include/fluent-bit/flb_mem.h (1)
  • flb_free (126-128)
plugins/out_s3/s3_multipart.c (1)
  • upload_part (685-799)
src/flb_pack.c (1)
  • flb_pack_msgpack_to_json_format (1169-1450)
tests/internal/parquet.c (2)
src/flb_parquet_impl.cpp (2)
  • flb_msgpack_raw_to_parquet (579-952)
  • flb_msgpack_raw_to_parquet (579-582)
include/fluent-bit/flb_mem.h (1)
  • flb_free (126-128)
🪛 Cppcheck (2.19.0)
plugins/out_s3/s3.c

[error] 791-791: failed to expand 'flb_plg_error', it is invalid to use a preprocessor directive as macro parameter

(syntaxError)

🪛 LanguageTool
examples/s3_output/README.md

[grammar] ~56-~56: Ensure spelling is correct
Context: ..., 201, 404, 500, 401) - Response times (2ms to 5000ms) - Client information (IP, us...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[grammar] ~465-~465: Ensure spelling is correct
Context: ...les uploaded when reaching 50MB (before 60min timeout) Verification Steps: 1. R...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🪛 markdownlint-cli2 (0.18.1)
examples/s3_output/README.md

112-112: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


177-177: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


270-270: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🔇 Additional comments (23)
examples/s3_output/README.md (1)

1-512: Excellent comprehensive documentation for S3 Parquet testing.

This new README provides a thorough, well-structured guide covering realistic test scenarios (two input sources, two output configurations), upload trigger mechanics, schema design best practices, data quality considerations, and Athena integration. The pedagogical tone helps users understand why decisions matter (e.g., nullable vs non-nullable fields, PutObject vs Multipart). The inclusion of expected behavior verification steps and troubleshooting guidance adds practical value. This fulfills the docs-required label requirement.

src/aws/flb_aws_compress.c (3)

31-37: LGTM!

The wrapper function correctly adapts the flb_snappy_compress signature to match the generic compression interface. The pointer casts are safe and the return value is properly propagated.


66-70: The FLB_AWS_COMPRESS_SNAPPY constant is properly defined.

The constant is defined in include/fluent-bit/aws/flb_aws_compress.h:30 with value 2, and the flb_snappy_compress_wrapper function is correctly defined in src/aws/flb_aws_compress.c:32. The array entry is correctly integrated.


26-26: No action needed. The include statement is correct as written.

Snappy is a core bundled dependency (located at lib/snappy-fef67ac/) built unconditionally in CMakeLists.txt. Unlike Arrow/Parquet which required #ifdef FLB_HAVE_ARROW guards as optional components, Snappy has no FLB_HAVE_SNAPPY conditional guards anywhere in the codebase and is included unconditionally across 12 existing files (http_server, flb_snappy.c, multiple plugins, etc.). The include at line 26 follows the established pattern.

lib/avro/CMakeLists.txt (1)

19-19: CMake version bump looks appropriate.

Bumping to CMake 3.12 aligns with the broader build system changes in this PR and enables modern CMake features. This is a reasonable minimum version for 2024+ projects.

lib/jansson-e23f558/CMakeLists.txt (2)

1-1: CMake version bump consistent with the PR.

Aligns with the 3.12 minimum version used across other CMake files in this PR.


567-569: Reasonable approach to avoid target name collisions.

Disabling the check target prevents conflicts with other libraries (e.g., nghttp2) that may define the same target name. The comment adequately explains the rationale.

plugins/out_s3/s3.h (2)

195-203: Clean addition of format support.

The new format field and schema_str pointer cleanly separate output format concerns from compression. The FLB_S3_FORMAT_JSON / FLB_S3_FORMAT_PARQUET constants follow a clear naming convention consistent with Fluent Bit patterns.


233-234: Function prototype aligns with SDS-based buffer handling.

The use of flb_sds_t buffer is consistent with Fluent Bit's string handling patterns. Note that the parameter names differ slightly from the implementation in s3.c (create_time vs file_first_log_time, buffer vs body), but this is cosmetic and doesn't affect functionality.

include/fluent-bit/flb_parquet.h (1)

27-85: Excellent documentation and clean API design.

The header provides comprehensive documentation including:

  • Clear parameter descriptions
  • Schema format specification with supported types
  • Usage examples
  • Memory ownership semantics (caller frees with flb_free)

This level of documentation is valuable for a public API.

cmake/parquet.cmake (1)

269-278: Variable exports look correct.

All discovered variables are properly exported to the parent scope, including conditional export of ARROW_VERSION. This ensures the discovery results are available to the main CMakeLists.txt.

cmake/kafka.cmake (2)

9-48: Clean simplification of SASL configuration.

The refactoring correctly recognizes that librdkafka has built-in support for SASL/PLAIN, SASL/SCRAM, and SASL/OAUTHBEARER without external dependencies. Only SASL/GSSAPI (Kerberos) requires cyrus-sasl.

Key improvements:

  • FLB_SASL_ENABLED is always ON (correct for built-in mechanisms)
  • FLB_SASL_OAUTHBEARER_ENABLED is always ON (built into librdkafka)
  • FLB_KAFKA_MSK_IAM_ENABLED is always ON (depends on OAuth Bearer)
  • WITH_SASL_CYRUS is conditional on finding cyrus-sasl (for Kerberos only)

This removes unnecessary complexity while maintaining full functionality.


77-81: Helpful feature summary for build diagnostics.

The summary block clearly shows which Kafka features are enabled, aiding troubleshooting of build configurations.

include/fluent-bit/aws/flb_aws_compress.h (1)

24-46: Well-documented enum reorganization with clear deprecation guidance.

The separation of compression algorithms (NONE=0, GZIP=1, SNAPPY=2, ZSTD=3) from format conversions (ARROW=4, PARQUET=5) improves semantic clarity. The deprecation comments provide clear migration paths.

The codebase exclusively uses symbolic constants (FLB_AWS_COMPRESS_*) rather than hardcoded numeric values, so the enum value changes are transparent to all callers and pose no breaking change risk.

Likely an incorrect or invalid review comment.

CMakeLists.txt (2)

1091-1111: Well-designed Homebrew libyaml detection for macOS.

The automatic detection of libyaml via Homebrew (brew --prefix libyaml) is a helpful developer experience improvement for macOS users. The graceful fallback sequence (pkg-config → Homebrew → compile test) ensures broad compatibility.


1337-1355: Parquet encoder integration looks correct.

The Parquet encoder setup properly:

  • Enables C++ and sets C++17 standard (addressing past review feedback)
  • Includes cmake/parquet.cmake for library detection
  • Guards all configuration behind FLB_PARQUET_ENCODER
  • Provides clear warnings when dependencies are missing
tests/internal/parquet.c (2)

33-66: Excellent test organization and coverage.

The test suite provides comprehensive coverage across 38 tests in 9 well-documented categories. The structured organization with clear category headers makes the test suite maintainable and easy to navigate.


74-103: Consistent and correct test pattern.

All tests follow a clean pattern: construct msgpack input, invoke the conversion function, assert success, and cleanup resources. The consistent use of flb_free for output buffers ensures no memory leaks.

src/flb_parquet_impl.cpp (3)

649-727: Immediate record processing correctly avoids dangling pointers.

Processing each record immediately during the msgpack unpack loop (rather than storing msgpack_object* pointers for later) correctly addresses the lifetime issue flagged in previous reviews. The msgpack_unpacked zone is reused on each iteration, so this approach ensures data integrity.


579-952: Robust implementation with comprehensive error handling.

The main conversion function demonstrates solid engineering:

  • Input validation and compression mapping
  • Proper Arrow/Parquet API usage with status checking
  • Immediate record processing to avoid lifetime issues
  • Statistics tracking for data quality monitoring
  • Clean exception handling with resource cleanup
  • Correct memory management between Arrow and Fluent Bit allocators

775-847: Fix missing newlines in debug log statements.

All flb_debug calls in the statistics output section are missing trailing \n characters, which will cause log messages to run together incorrectly. Each flb_debug call should end with a newline.

🔎 Proposed fix for debug log formatting
         if (has_issues) {
-            flb_debug("[parquet] Data quality summary for %zu records:", record_count);
+            flb_debug("[parquet] Data quality summary for %zu records:\n", record_count);

             if (!converter.complex_to_string_by_field.empty()) {
-                flb_debug("[parquet] Complex types serialized to JSON string:");
+                flb_debug("[parquet] Complex types serialized to JSON string:\n");
                 for (const auto& pair : converter.complex_to_string_by_field) {
-                    flb_debug("[parquet]   field='%s' count=%zu",
-                            pair.first.c_str(), pair.second);
+                    flb_debug("[parquet]   field='%s' count=%zu\n",
+                            pair.first.c_str(), pair.second);
                 }
             }

             if (!converter.string_parsed_to_int_by_field.empty()) {
-                flb_debug("[parquet] Strings parsed to integers:");
+                flb_debug("[parquet] Strings parsed to integers:\n");
                 for (const auto& pair : converter.string_parsed_to_int_by_field) {
-                    flb_debug("[parquet]   field='%s' count=%zu",
-                            pair.first.c_str(), pair.second);
+                    flb_debug("[parquet]   field='%s' count=%zu\n",
+                            pair.first.c_str(), pair.second);
                 }
             }

             /* Apply similar fixes to remaining flb_debug calls in this section */

Apply this pattern to all remaining flb_debug calls through line 847.

Likely an incorrect or invalid review comment.

plugins/out_s3/s3.c (2)

744-824: Verify legacy Parquet compression behavior.

The legacy compression=arrow/parquet path sets format=parquet but leaves ctx->compression = FLB_AWS_COMPRESS_NONE (initialized at line 745). A previous review comment suggested this should default to GZIP for backward compatibility. The comment at line 764 says "Keep default no compression," but if the old Arrow/Parquet path applied GZIP by default, this represents a breaking change.

Please verify whether the legacy compression path should default to GZIP when format=parquet is set via the deprecated compression=arrow/parquet values.

Based on the previous review comment at lines 744-824.


42-44: LGTM: Parquet integration is well-implemented.

The Parquet format support is thoughtfully designed:

  1. Clean separation: Parquet uses internal compression (passed to flb_msgpack_raw_to_parquet), while outer compression is correctly skipped for Parquet format throughout the codebase.

  2. Efficient buffering: The architectural change to store raw msgpack data (lines 3953-3962) instead of pre-converting to JSON is a smart optimization that:

    • Avoids repeated conversions when buffering
    • Enables efficient batching for columnar formats
    • Defers format conversion until upload time (line 1598)
  3. Proper memory management: All SDS buffers are consistently managed with flb_sds_create_len and flb_sds_destroy. Arrow-allocated Parquet buffers are correctly freed with flb_free (line 1617) after copying to SDS.

  4. Good error handling: Schema validation (lines 1588-1596), conversion error checking (lines 1605-1613), and cleanup on all error paths are properly implemented.

Also applies to: 1520-1661, 3953-3962

@kalavt kalavt force-pushed the feature/aws-s3-parquet branch from 37a602a to 7f32dd4 Compare December 25, 2025 07:32
@kalavt kalavt force-pushed the feature/aws-s3-parquet branch from 7f32dd4 to fba95e4 Compare December 25, 2025 07:37
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (3)
src/aws/flb_aws_compress.c (1)

46-56: Inconsistent comment about ARROW/PARQUET removal.

Line 53 states "ARROW: DEPRECATED - Arrow is not a proper file format for S3, kept only for compilation", but ARROW has been removed from the compression_options array (it's no longer present in the code). The comment should reflect that ARROW has been removed entirely, not "kept."

🔎 Suggested fix
  * IMPORTANT NOTES:
  * 1. True compression algorithms: none, gzip, snappy, zstd
  * 2. Format converters:
- *    - ARROW: DEPRECATED - Arrow is not a proper file format for S3, kept only for compilation
- *    - PARQUET: Valid file format converter (deprecated: use format=parquet instead)
+ *    - ARROW: Removed (was not a proper file format for S3)
+ *    - PARQUET: Removed as compression option (use format=parquet instead)
  * 3. Supported S3 output formats: json (FLB_S3_FORMAT_JSON), parquet (FLB_S3_FORMAT_PARQUET)
src/flb_parquet_impl.cpp (1)

638-756: On-the-fly record processing correctly addresses the msgpack_object lifetime issue.

The implementation now processes records immediately via the process_record lambda instead of storing msgpack_object pointers. This fixes the previous dangling pointer concern where msgpack_unpack_next would invalidate earlier objects.

plugins/out_s3/s3.c (1)

744-824: Legacy compression handling comment is misleading.

The comment at line 764 states "Keep default gzip compression", but the code does not set ctx->compression when the deprecated compression=arrow|parquet values are used. This leaves ctx->compression at its default value of FLB_AWS_COMPRESS_NONE (line 745), resulting in uncompressed Parquet output rather than the gzip compression implied by the comment.

If the intent is to maintain backward compatibility with gzip compression for legacy users, update the code to explicitly set the compression:

#ifdef FLB_HAVE_PARQUET_ENCODER
    ctx->format = FLB_S3_FORMAT_PARQUET;
    if (ctx->compression == FLB_AWS_COMPRESS_NONE) {
        ctx->compression = FLB_AWS_COMPRESS_GZIP;
    }
#endif

Otherwise, update the comment to accurately reflect that no compression is applied by default.

</comment_end>

🧹 Nitpick comments (3)
tests/internal/parquet.c (1)

27-31: Consider including the actual header instead of redefining compression constants.

These compression constants are duplicated from flb_aws_compress.h. If the values in the header change, these local definitions will become out of sync, causing tests to use incorrect compression types.

🔎 Proposed fix
-/* Compression types from flb_aws_compress.h */
-#define FLB_AWS_COMPRESS_NONE    0
-#define FLB_AWS_COMPRESS_GZIP    1
-#define FLB_AWS_COMPRESS_SNAPPY  2
-#define FLB_AWS_COMPRESS_ZSTD    3
+#include <fluent-bit/aws/flb_aws_compress.h>

If including the header causes issues due to dependencies, consider adding a comment noting the values must stay synchronized with the source header.

src/flb_parquet_impl.cpp (1)

219-220: Float to int64 conversion may lose precision for large values.

When converting float64 to int64, very large float values (beyond int64 range) could produce undefined behavior or unexpected results. Consider adding bounds checking similar to convert_to_int32.

🔎 Proposed fix
     } else if (obj->type == MSGPACK_OBJECT_FLOAT32 || obj->type == MSGPACK_OBJECT_FLOAT64) {
-        return builder->Append(static_cast<int64_t>(obj->via.f64));
+        /* Check bounds for float to int64 conversion */
+        if (obj->via.f64 > static_cast<double>(INT64_MAX) || 
+            obj->via.f64 < static_cast<double>(INT64_MIN)) {
+            float_to_int_clamp_by_field[field_name]++;
+            return builder->Append(obj->via.f64 > 0 ? INT64_MAX : INT64_MIN);
+        }
+        return builder->Append(static_cast<int64_t>(obj->via.f64));
     }
plugins/out_s3/s3.c (1)

4219-4231: Consider validating schema_str during initialization.

The new format and schema_str configuration parameters are well-documented. However, validation that schema_str is provided when format=parquet currently happens at flush time (lines 1588-1596 in construct_request_buffer). Consider adding this validation to cb_s3_init to fail fast rather than discovering the configuration error when the first chunk is processed:

/* In cb_s3_init, after format parsing */
#ifdef FLB_HAVE_PARQUET_ENCODER
    if (ctx->format == FLB_S3_FORMAT_PARQUET && !ctx->schema_str) {
        flb_plg_error(ctx->ins, "schema_str is required when format=parquet");
        return -1;
    }
#endif

This would improve the user experience by providing immediate feedback on configuration errors.

</comment_end>

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a409e9f and 7f32dd4.

📒 Files selected for processing (25)
  • .github/workflows/unit-tests.yaml
  • CMakeLists.txt
  • cmake/kafka.cmake
  • cmake/parquet.cmake
  • examples/s3_output/README.md
  • examples/s3_output/athena-queries.sql
  • examples/s3_output/fluent-bit-s3-parquet.conf
  • examples/s3_output/run-s3-parquet-test.sh
  • include/fluent-bit/aws/flb_aws_compress.h
  • include/fluent-bit/flb_parquet.h
  • lib/avro/CMakeLists.txt
  • lib/jansson-e23f558/CMakeLists.txt
  • plugins/out_kinesis_firehose/firehose.c
  • plugins/out_s3/s3.c
  • plugins/out_s3/s3.h
  • src/CMakeLists.txt
  • src/aws/CMakeLists.txt
  • src/aws/compression/CMakeLists.txt
  • src/aws/compression/arrow/CMakeLists.txt
  • src/aws/compression/arrow/compress.c
  • src/aws/compression/arrow/compress.h
  • src/aws/flb_aws_compress.c
  • src/flb_parquet_impl.cpp
  • tests/internal/CMakeLists.txt
  • tests/internal/parquet.c
💤 Files with no reviewable changes (5)
  • src/aws/compression/arrow/compress.c
  • src/aws/compression/CMakeLists.txt
  • src/aws/compression/arrow/compress.h
  • src/aws/compression/arrow/CMakeLists.txt
  • src/aws/CMakeLists.txt
🚧 Files skipped from review as they are similar to previous changes (7)
  • .github/workflows/unit-tests.yaml
  • plugins/out_kinesis_firehose/firehose.c
  • lib/jansson-e23f558/CMakeLists.txt
  • examples/s3_output/run-s3-parquet-test.sh
  • include/fluent-bit/flb_parquet.h
  • examples/s3_output/fluent-bit-s3-parquet.conf
  • lib/avro/CMakeLists.txt
🧰 Additional context used
🧠 Learnings (15)
📚 Learning: 2025-11-21T06:23:29.770Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 11171
File: include/fluent-bit/flb_lib.h:52-53
Timestamp: 2025-11-21T06:23:29.770Z
Learning: In Fluent Bit core (fluent/fluent-bit repository), function descriptions/documentation are not required for newly added functions in header files.

Applied to files:

  • examples/s3_output/README.md
📚 Learning: 2025-08-07T10:15:46.187Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 10691
File: .github/workflows/unit-tests.yaml:94-101
Timestamp: 2025-08-07T10:15:46.187Z
Learning: Apache Arrow latest packages have changing checksums, making hardcoded checksum verification impractical for CI workflows that download the latest version.

Applied to files:

  • cmake/parquet.cmake
📚 Learning: 2025-08-31T12:46:11.940Z
Learnt from: ThomasDevoogdt
Repo: fluent/fluent-bit PR: 9277
File: .github/workflows/pr-compile-check.yaml:147-151
Timestamp: 2025-08-31T12:46:11.940Z
Learning: In fluent-bit CMakeLists.txt, the system library preference flags are defined as FLB_PREFER_SYSTEM_LIB_ZSTD and FLB_PREFER_SYSTEM_LIB_KAFKA with the FLB_ prefix.

Applied to files:

  • cmake/parquet.cmake
  • src/CMakeLists.txt
  • CMakeLists.txt
  • cmake/kafka.cmake
  • tests/internal/CMakeLists.txt
📚 Learning: 2025-08-31T12:46:11.940Z
Learnt from: ThomasDevoogdt
Repo: fluent/fluent-bit PR: 9277
File: .github/workflows/pr-compile-check.yaml:147-151
Timestamp: 2025-08-31T12:46:11.940Z
Learning: In fluent-bit, the correct CMake flag for using system librdkafka is `FLB_PREFER_SYSTEM_LIB_KAFKA=ON`.

Applied to files:

  • cmake/parquet.cmake
  • src/CMakeLists.txt
  • CMakeLists.txt
  • cmake/kafka.cmake
📚 Learning: 2025-08-29T06:25:27.250Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:93-107
Timestamp: 2025-08-29T06:25:27.250Z
Learning: In Fluent Bit, ZSTD compression is enabled by default and is treated as a core dependency, not requiring conditional compilation guards like `#ifdef FLB_HAVE_ZSTD`. Unlike some other optional components such as ARROW/PARQUET (which use `#ifdef FLB_HAVE_ARROW` guards), ZSTD support is always available and doesn't need build-time conditionals. ZSTD headers are included directly without guards across multiple plugins and core components.

Applied to files:

  • src/aws/flb_aws_compress.c
  • plugins/out_s3/s3.c
  • src/CMakeLists.txt
  • CMakeLists.txt
  • cmake/kafka.cmake
  • include/fluent-bit/aws/flb_aws_compress.h
📚 Learning: 2025-08-29T06:24:55.855Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: src/aws/flb_aws_compress.c:52-56
Timestamp: 2025-08-29T06:24:55.855Z
Learning: ZSTD compression is always available in Fluent Bit and does not require conditional compilation guards. Unlike Arrow/Parquet which use #ifdef FLB_HAVE_ARROW guards, ZSTD is built unconditionally with flb_zstd.c included directly in src/CMakeLists.txt and a bundled ZSTD library at lib/zstd-1.5.7/.

Applied to files:

  • src/aws/flb_aws_compress.c
  • plugins/out_s3/s3.c
  • src/CMakeLists.txt
  • CMakeLists.txt
  • src/flb_parquet_impl.cpp
  • include/fluent-bit/aws/flb_aws_compress.h
📚 Learning: 2025-08-29T06:24:26.170Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:39-42
Timestamp: 2025-08-29T06:24:26.170Z
Learning: In Fluent Bit, ZSTD compression support is enabled by default and does not require conditional compilation guards (like #ifdef FLB_HAVE_ZSTD) around ZSTD-related code declarations and implementations.

Applied to files:

  • src/aws/flb_aws_compress.c
  • plugins/out_s3/s3.c
  • src/CMakeLists.txt
  • CMakeLists.txt
  • include/fluent-bit/aws/flb_aws_compress.h
📚 Learning: 2025-08-29T06:25:02.561Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:7-7
Timestamp: 2025-08-29T06:25:02.561Z
Learning: In Fluent Bit, ZSTD (zstandard) compression library is bundled directly in the source tree at `lib/zstd-1.5.7` and is built unconditionally as a static library. Unlike optional external dependencies, ZSTD does not use conditional compilation guards like `FLB_HAVE_ZSTD` and is always available. Headers like `<fluent-bit/flb_zstd.h>` can be included directly without guards.

Applied to files:

  • src/aws/flb_aws_compress.c
  • plugins/out_s3/s3.c
  • src/CMakeLists.txt
  • CMakeLists.txt
  • include/fluent-bit/aws/flb_aws_compress.h
📚 Learning: 2025-08-29T06:25:27.250Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:93-107
Timestamp: 2025-08-29T06:25:27.250Z
Learning: In Fluent Bit, ZSTD compression is enabled by default and is treated as a core dependency, not requiring conditional compilation guards like `#ifdef FLB_HAVE_ZSTD`. Unlike some other optional components, ZSTD support is always available and doesn't need build-time conditionals.

Applied to files:

  • src/aws/flb_aws_compress.c
  • plugins/out_s3/s3.c
  • src/CMakeLists.txt
  • CMakeLists.txt
  • include/fluent-bit/aws/flb_aws_compress.h
📚 Learning: 2025-08-29T06:24:44.797Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: src/aws/flb_aws_compress.c:26-26
Timestamp: 2025-08-29T06:24:44.797Z
Learning: In Fluent Bit, ZSTD support is always available and enabled by default. The build system automatically detects and uses either the system libzstd library or builds the bundled ZSTD version. Unlike other optional dependencies like Arrow which use conditional compilation guards (e.g., FLB_HAVE_ARROW), ZSTD does not require conditional includes or build flags.

Applied to files:

  • src/aws/flb_aws_compress.c
  • src/CMakeLists.txt
  • CMakeLists.txt
  • include/fluent-bit/aws/flb_aws_compress.h
📚 Learning: 2025-09-04T12:35:22.872Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10825
File: plugins/out_s3/s3.c:1339-1344
Timestamp: 2025-09-04T12:35:22.872Z
Learning: In the Fluent Bit S3 plugin, the user prefers to maintain current retry_limit behavior without special handling for FLB_OUT_RETRY_UNLIMITED (-1), as there's no documentation indicating -1 should be used for infinite retries and consistency with current logic is preferred.

Applied to files:

  • plugins/out_s3/s3.c
📚 Learning: 2025-12-22T05:39:02.291Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 11250
File: lib/ripser-1.2.1/Makefile:1-18
Timestamp: 2025-12-22T05:39:02.291Z
Learning: In fluent/fluent-bit, the Makefile at lib/ripser-1.2.1/Makefile is imported from upstream Ripser and is not used in the actual build process. The project uses CMake for building (lib/ripser-1.2.1/CMakeLists.txt), so changes to the imported Makefile are not necessary.

Applied to files:

  • src/CMakeLists.txt
📚 Learning: 2025-09-08T11:21:33.975Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 10851
File: include/fluent-bit/flb_simd.h:60-66
Timestamp: 2025-09-08T11:21:33.975Z
Learning: Fluent Bit currently only supports MSVC compiler on Windows, so additional compiler compatibility guards may be unnecessary for Windows-specific code paths.

Applied to files:

  • src/CMakeLists.txt
  • CMakeLists.txt
📚 Learning: 2025-09-14T09:46:09.531Z
Learnt from: aminvakil
Repo: fluent/fluent-bit PR: 10844
File: conf/fluent-bit:13-15
Timestamp: 2025-09-14T09:46:09.531Z
Learning: For fluent-bit Debian packaging, /opt/fluent-bit/bin/ is the appropriate installation path since the package may be installed from non-official Debian sources, making /opt compliant with FHS for optional software packages.

Applied to files:

  • src/CMakeLists.txt
📚 Learning: 2025-09-04T07:28:37.083Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 10832
File: src/wasm/CMakeLists.txt:112-131
Timestamp: 2025-09-04T07:28:37.083Z
Learning: In fluent-bit CMake files, the user cosmo0920 prefers treating Git as a command rather than a package, emphasizing that Git is not a pkg-config retrievable package but just a command.

Applied to files:

  • src/CMakeLists.txt
🧬 Code graph analysis (3)
src/aws/flb_aws_compress.c (2)
src/flb_gzip.c (1)
  • flb_gzip_compress (157-252)
src/flb_zstd.c (1)
  • flb_zstd_compress (33-57)
plugins/out_s3/s3.h (1)
plugins/out_s3/s3.c (1)
  • s3_put_object (1663-1800)
src/flb_parquet_impl.cpp (1)
include/fluent-bit/flb_mem.h (1)
  • flb_free (126-128)
🪛 Cppcheck (2.19.0)
plugins/out_s3/s3.c

[error] 791-791: failed to expand 'flb_plg_error', it is invalid to use a preprocessor directive as macro parameter

(syntaxError)

🪛 LanguageTool
examples/s3_output/README.md

[grammar] ~56-~56: Ensure spelling is correct
Context: ..., 201, 404, 500, 401) - Response times (2ms to 5000ms) - Client information (IP, us...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[grammar] ~465-~465: Ensure spelling is correct
Context: ...les uploaded when reaching 50MB (before 60min timeout) Verification Steps: 1. R...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🪛 markdownlint-cli2 (0.18.1)
examples/s3_output/README.md

112-112: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


177-177: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


270-270: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🔇 Additional comments (29)
src/aws/flb_aws_compress.c (2)

31-37: LGTM! Clean adapter implementation.

The wrapper function correctly adapts the flb_snappy_compress signature to match the compression interface, safely casting between void* and char* pointer types.


26-26: No issues found. SNAPPY support is built unconditionally in the codebase (flb_snappy.c listed directly in CMakeLists.txt without guards) and headers are included without conditional compilation guards across 12 files throughout the codebase, consistent with how ZSTD and GZIP are handled as core compression libraries.

examples/s3_output/athena-queries.sql (1)

15-80: LGTM! Well-structured Athena table definitions.

Both external table definitions follow Athena best practices with proper partition projection configuration, appropriate column types, and correct SerDe/storage format specifications.

examples/s3_output/README.md (1)

351-364: LGTM! Clear and accurate Parquet compression documentation.

The compression section correctly documents Parquet's internal compression behavior, default values, and the available compression algorithms (GZIP, SNAPPY, ZSTD, NONE) with appropriate examples.

include/fluent-bit/aws/flb_aws_compress.h (1)

24-46: LGTM! Clear documentation of compression vs format conversion.

The updated comments clearly distinguish between true compression algorithms (gzip, snappy, zstd) and file format converters (deprecated ARROW, PARQUET), with appropriate deprecation notices and migration guidance to use format=parquet instead of compression=parquet.

plugins/out_s3/s3.h (1)

196-203: LGTM! Clean public API additions for format support.

The new format field and schema_str field, along with the format type constants (FLB_S3_FORMAT_JSON, FLB_S3_FORMAT_PARQUET), provide a clear API for specifying output formats and Parquet schemas. The inline comments appropriately document the purpose of each field.

tests/internal/CMakeLists.txt (1)

136-141: LGTM! Parquet test integration properly implemented.

The Parquet test file is correctly added under the FLB_PARQUET_ENCODER guard and properly linked with ARROW_LIBRARIES and PARQUET_LIBRARIES, mirroring the existing Avro encoder pattern. The previous issue with missing linkage has been addressed.

Also applies to: 238-240

src/CMakeLists.txt (2)

157-166: LGTM! Proper C++ source integration for Parquet encoder.

The Parquet implementation file is correctly added under the FLB_PARQUET_ENCODER guard with the LANGUAGE CXX property explicitly set, ensuring proper C++ compilation for the Arrow/Parquet integration.


398-405: LGTM! Correct library linkage for Parquet encoder.

The Arrow and Parquet libraries are properly linked when FLB_PARQUET_ENCODER is enabled, providing the necessary dependencies for the Parquet encoding functionality.

cmake/parquet.cmake (5)

1-38: LGTM! Well-structured multi-platform detection module.

The module header clearly documents exported variables and implements a logical three-method detection strategy (pkg-config, CMake Config, manual search) with appropriate platform detection. The initial pkg-config approach is clean and efficient for Linux/macOS systems.


88-147: LGTM! Comprehensive platform-specific search path configuration.

The manual search section properly configures platform-specific paths for Windows (vcpkg, chocolatey), macOS (Homebrew with Apple Silicon detection, MacPorts), and Linux (standard system paths with 64/32-bit lib directory handling). The search suffixes and library name variants are appropriate for each platform.


201-217: LGTM! Robust Arrow version detection.

The version extraction from arrow/util/config.h using regex is clean and includes a helpful minimum version check (10.0.0) with a warning for older versions. The optional nature of version detection (with the existence check) is appropriate.


222-264: LGTM! Comprehensive installation guidance.

The installation hints provide clear, platform-specific instructions for Windows (vcpkg), macOS (Homebrew, MacPorts), and Linux (Ubuntu/Debian, RHEL/CentOS) with proper commands and links. The conditional display based on FLB_PARQUET_ENCODER and the option to disable with -DFLB_PARQUET_ENCODER=Off is helpful.


173-195: This review comment is incorrect. The current code at line 178 in cmake/parquet.cmake uses PATH_SUFFIXES ${INCLUDE_SUFFIXES} without the extra parquet suffix you referenced. The Parquet header search correctly follows the same pattern as the Arrow search above it, searching only in standard include paths without doubling the parquet directory component. No action is needed.

Likely an incorrect or invalid review comment.

cmake/kafka.cmake (2)

9-48: Well-structured SASL configuration with clear separation of concerns.

The refactored SASL handling correctly distinguishes between librdkafka's built-in SASL mechanisms (PLAIN/SCRAM/OAUTHBEARER) and external Cyrus-SASL (for GSSAPI/Kerberos). The fallback detection path for environments without pkg-config is a good addition for broader compatibility.


76-81: Helpful feature summary for build diagnostics.

The summary block provides clear visibility into what Kafka features are enabled, which aids debugging configuration issues.

tests/internal/parquet.c (2)

33-66: Excellent test organization and comprehensive coverage.

The test suite is well-documented with clear categories covering basic functionality, data types, schema variations, type conversions, boundary values, nullable handling, complex types, compression, and error handling. The 38 tests provide thorough coverage of the Parquet encoder.


1255-1311: Test list is complete and properly organized.

All 38 tests are registered with descriptive names matching their categories. The NULL terminator is correctly placed.

src/flb_parquet_impl.cpp (5)

39-42: Forward declaration is a reasonable workaround for mpack.h compatibility.

The comment clearly explains why flb_msgpack_to_json_str is forward-declared instead of including flb_pack.h. This avoids the C function overloading conflict with C++ compilation.


210-216: Int64 overflow handling correctly addresses previous review feedback.

The overflow check for uint64 → int64 conversion with clamping to INT64_MAX properly handles the edge case where obj->via.u64 > INT64_MAX.


364-368: Clear documentation of timestamp unit handling semantics.

The comment clearly explains that the implementation performs type conversion only, not unit conversion. This addresses the previous review concern and properly documents the user's responsibility.


409-541: Schema parser is well-implemented with comprehensive type support.

The parser correctly handles both string and object notation for types, provides clear error messages with schema context, and properly frees yyjson resources on all code paths.


939-951: Exception handling is correctly structured.

The catch blocks properly handle both Parquet-specific and general exceptions, with appropriate cleanup of output_buffer when allocated.

CMakeLists.txt (4)

202-202: New FLB_PARQUET_ENCODER option correctly added.

The option defaults to No, requiring explicit enablement, which is appropriate for a feature with external dependencies.


284-288: Appropriate exclusion of Parquet encoder from FLB_ALL.

The comment clearly explains that FLB_PARQUET_ENCODER requires external dependencies and must be explicitly enabled. This prevents build failures when Arrow/Parquet libraries are not installed.


1091-1111: Homebrew detection improves macOS developer experience.

The automatic detection of libyaml via Homebrew on macOS reduces friction for developers who have installed dependencies through Homebrew.


1337-1355: Parquet encoder configuration is well-structured.

The configuration correctly:

  1. Enables C++ and sets C++17 standard (required by Arrow/Parquet)
  2. Uses a dedicated cmake/parquet.cmake module for library detection
  3. Gracefully disables the feature with warnings if dependencies are missing

This addresses the previous review concern about C++ standard version.

plugins/out_s3/s3.c (2)

1582-1631: Parquet conversion implementation looks solid.

The Parquet conversion path properly:

  • Validates required schema_str parameter (lines 1588-1596)
  • Converts msgpack to Parquet using the Arrow encoder (lines 1598-1603)
  • Manages memory consistently with SDS buffers (lines 1615-1627)
  • Handles errors and increments failure counters appropriately

The ctx->compression value passed to flb_msgpack_raw_to_parquet (line 1601) correctly delegates compression to the internal Parquet encoder, avoiding redundant outer compression layers for the Parquet format.

</comment_end>


1244-1277: LGTM: Compression handling with SDS buffers.

The compression implementation correctly:

  • Skips outer compression for Parquet format, which handles compression internally (line 1245)
  • Converts compressed data to flb_sds_t for consistent memory management (lines 1260-1263)
  • Frees temporary compression buffer immediately after conversion (line 1262)
  • Handles errors appropriately with proper chunk unlock and failure tracking

The multiple cleanup points throughout the function (lines 1333-1335, 1361-1364, 1376-1379, 1387-1391) ensure compressed_body is destroyed in all code paths.

</comment_end>

@kalavt
Copy link
Author

kalavt commented Dec 25, 2025

@patrick-stephens @edsiper @cosmo0920 can we have package test here?

Implement Parquet format writer with Apache Arrow C++ library for
efficient columnar storage. This core implementation provides the foundation
for Parquet output support in fluent-bit.

Signed-off-by: Arbin Cheng <[email protected]>
Signed-off-by: Arbin <[email protected]>
@kalavt kalavt force-pushed the feature/aws-s3-parquet branch from fba95e4 to 37db121 Compare December 29, 2025 15:35
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (6)
cmake/kafka.cmake (2)

46-50: Minor: Inconsistent casing in WITH_SSL option.

Line 48 uses On while other options use ON. Consider using consistent casing for maintainability.

🔎 Proposed fix
-FLB_OPTION(WITH_SSL On)                                        # SSL support
+FLB_OPTION(WITH_SSL ON)                                        # SSL support

52-65: Dead code: else branches are unreachable.

Since FLB_SASL_ENABLED (line 38) and FLB_SASL_OAUTHBEARER_ENABLED (line 41) are unconditionally set to ON, the else branches at lines 56-58 and 63-65 will never execute. Consider removing them or, if you want to preserve the option for future conditional builds, use a CMake option that can be overridden.

🔎 Proposed simplification
 # Export compile-time definitions using FLB_DEFINITION macro
-if(FLB_SASL_ENABLED)
-  FLB_DEFINITION(FLB_HAVE_KAFKA_SASL)
-  message(STATUS "Kafka SASL authentication: ENABLED")
-else()
-  message(STATUS "Kafka SASL authentication: DISABLED")
-endif()
+# SASL is always enabled (built-in support)
+FLB_DEFINITION(FLB_HAVE_KAFKA_SASL)
+message(STATUS "Kafka SASL authentication: ENABLED")

-if(FLB_SASL_OAUTHBEARER_ENABLED)
-  FLB_DEFINITION(FLB_HAVE_KAFKA_OAUTHBEARER)
-  message(STATUS "Kafka OAuth Bearer: ENABLED")
-else()
-  message(STATUS "Kafka OAuth Bearer: DISABLED")
-endif()
+# OAuth Bearer is always enabled (built-in support)
+FLB_DEFINITION(FLB_HAVE_KAFKA_OAUTHBEARER)
+message(STATUS "Kafka OAuth Bearer: ENABLED")
include/fluent-bit/flb_parquet.h (1)

23-25: Consider removing unused include.

flb_sds.h is included but neither flb_sds_t nor any SDS-related types appear in the function signature or the header. This include appears unnecessary for this header.

🔎 Proposed fix
-#include <fluent-bit/flb_sds.h>
 #include <msgpack.h>
 #include <stdlib.h>
cmake/parquet.cmake (1)

222-228: Unicode characters in CMake messages may cause issues.

The checkmarks (✓/✗) in status messages may not render correctly on all terminals or CI systems. Consider using ASCII alternatives like [OK] and [MISSING] for broader compatibility.

🔎 Proposed fix
 if(ARROW_FOUND AND PARQUET_FOUND)
-  message(STATUS "✓ Arrow found: ${ARROW_LIBRARIES}")
+  message(STATUS "[OK] Arrow found: ${ARROW_LIBRARIES}")
   message(STATUS "  Include dirs: ${ARROW_INCLUDE_DIRS}")
-  message(STATUS "✓ Parquet found: ${PARQUET_LIBRARIES}")
+  message(STATUS "[OK] Parquet found: ${PARQUET_LIBRARIES}")
   message(STATUS "  Include dirs: ${PARQUET_INCLUDE_DIRS}")
 else()
-  message(STATUS "✗ Arrow/Parquet not found")
+  message(STATUS "[MISSING] Arrow/Parquet not found")
tests/internal/parquet.c (1)

27-31: Duplicated compression constants may drift out of sync.

These constants duplicate values from flb_aws_compress.h. If the upstream values change, these tests would silently use incorrect values.

🔎 Proposed fix

Consider including the header directly or adding a compile-time assertion:

+#include <fluent-bit/aws/flb_aws_compress.h>
+
 /* Compression types from flb_aws_compress.h */
-#define FLB_AWS_COMPRESS_NONE    0
-#define FLB_AWS_COMPRESS_GZIP    1
-#define FLB_AWS_COMPRESS_SNAPPY  2
-#define FLB_AWS_COMPRESS_ZSTD    3
+/* Using values from flb_aws_compress.h */

Or if the header cannot be included due to dependencies, add a static assertion in the implementation to ensure sync.

plugins/out_s3/s3.c (1)

1290-1290: Consider clarifying compression behavior for different formats.

The compression parameter has different behavior depending on format:

  • format=json: Outer compression applied to entire JSON file (line 1290, 1294)
  • format=parquet: Internal compression within Parquet file (line 1290 skips outer, line 1509 passes to encoder)

While this design is architecturally sound (Parquet handles compression natively), the configuration documentation at lines 4077-4079 could explicitly mention this distinction to avoid user confusion:

"Compression type for S3 objects. Supported values: `none`, `gzip`, `snappy`, `zstd`. Default: `none`. 
When format=json, compression is applied to the entire output file. 
When format=parquet, compression is applied internally within the Parquet file format."

Also applies to: 1509-1509, 4077-4079

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7f32dd4 and 37db121.

📒 Files selected for processing (26)
  • .github/workflows/unit-tests.yaml
  • CMakeLists.txt
  • cmake/kafka.cmake
  • cmake/parquet.cmake
  • examples/s3_output/README.md
  • examples/s3_output/fluent-bit-direct-s3.conf
  • examples/s3_output/fluent-bit-kafka-sink-s3.conf
  • include/fluent-bit/aws/flb_aws_compress.h
  • include/fluent-bit/flb_parquet.h
  • lib/avro/CMakeLists.txt
  • lib/jansson-e23f558/CMakeLists.txt
  • plugins/out_kinesis_firehose/firehose.c
  • plugins/out_s3/s3.c
  • plugins/out_s3/s3.h
  • plugins/out_s3/s3_multipart.c
  • src/CMakeLists.txt
  • src/aws/CMakeLists.txt
  • src/aws/compression/CMakeLists.txt
  • src/aws/compression/arrow/CMakeLists.txt
  • src/aws/compression/arrow/compress.c
  • src/aws/compression/arrow/compress.h
  • src/aws/flb_aws_compress.c
  • src/flb_parquet_impl.cpp
  • tests/internal/CMakeLists.txt
  • tests/internal/parquet.c
  • tests/runtime/out_s3.c
💤 Files with no reviewable changes (5)
  • src/aws/compression/arrow/compress.c
  • src/aws/compression/CMakeLists.txt
  • src/aws/compression/arrow/CMakeLists.txt
  • src/aws/CMakeLists.txt
  • src/aws/compression/arrow/compress.h
🚧 Files skipped from review as they are similar to previous changes (5)
  • .github/workflows/unit-tests.yaml
  • tests/internal/CMakeLists.txt
  • plugins/out_kinesis_firehose/firehose.c
  • lib/avro/CMakeLists.txt
  • lib/jansson-e23f558/CMakeLists.txt
🧰 Additional context used
🧠 Learnings (17)
📓 Common learnings
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: src/aws/flb_aws_compress.c:52-56
Timestamp: 2025-08-29T06:24:55.855Z
Learning: ZSTD compression is always available in Fluent Bit and does not require conditional compilation guards. Unlike Arrow/Parquet which use #ifdef FLB_HAVE_ARROW guards, ZSTD is built unconditionally with flb_zstd.c included directly in src/CMakeLists.txt and a bundled ZSTD library at lib/zstd-1.5.7/.
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:93-107
Timestamp: 2025-08-29T06:25:27.250Z
Learning: In Fluent Bit, ZSTD compression is enabled by default and is treated as a core dependency, not requiring conditional compilation guards like `#ifdef FLB_HAVE_ZSTD`. Unlike some other optional components such as ARROW/PARQUET (which use `#ifdef FLB_HAVE_ARROW` guards), ZSTD support is always available and doesn't need build-time conditionals. ZSTD headers are included directly without guards across multiple plugins and core components.
📚 Learning: 2025-09-04T12:35:22.872Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10825
File: plugins/out_s3/s3.c:1339-1344
Timestamp: 2025-09-04T12:35:22.872Z
Learning: In the Fluent Bit S3 plugin, the user prefers to maintain current retry_limit behavior without special handling for FLB_OUT_RETRY_UNLIMITED (-1), as there's no documentation indicating -1 should be used for infinite retries and consistency with current logic is preferred.

Applied to files:

  • tests/runtime/out_s3.c
  • examples/s3_output/fluent-bit-direct-s3.conf
  • plugins/out_s3/s3.c
📚 Learning: 2025-08-07T10:15:46.187Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 10691
File: .github/workflows/unit-tests.yaml:94-101
Timestamp: 2025-08-07T10:15:46.187Z
Learning: Apache Arrow latest packages have changing checksums, making hardcoded checksum verification impractical for CI workflows that download the latest version.

Applied to files:

  • cmake/parquet.cmake
📚 Learning: 2025-08-31T12:46:11.940Z
Learnt from: ThomasDevoogdt
Repo: fluent/fluent-bit PR: 9277
File: .github/workflows/pr-compile-check.yaml:147-151
Timestamp: 2025-08-31T12:46:11.940Z
Learning: In fluent-bit CMakeLists.txt, the system library preference flags are defined as FLB_PREFER_SYSTEM_LIB_ZSTD and FLB_PREFER_SYSTEM_LIB_KAFKA with the FLB_ prefix.

Applied to files:

  • cmake/parquet.cmake
  • CMakeLists.txt
  • cmake/kafka.cmake
  • src/CMakeLists.txt
📚 Learning: 2025-08-31T12:46:11.940Z
Learnt from: ThomasDevoogdt
Repo: fluent/fluent-bit PR: 9277
File: .github/workflows/pr-compile-check.yaml:147-151
Timestamp: 2025-08-31T12:46:11.940Z
Learning: In fluent-bit, the correct CMake flag for using system librdkafka is `FLB_PREFER_SYSTEM_LIB_KAFKA=ON`.

Applied to files:

  • cmake/parquet.cmake
  • CMakeLists.txt
  • cmake/kafka.cmake
  • src/CMakeLists.txt
📚 Learning: 2025-08-29T06:25:27.250Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:93-107
Timestamp: 2025-08-29T06:25:27.250Z
Learning: In Fluent Bit, ZSTD compression is enabled by default and is treated as a core dependency, not requiring conditional compilation guards like `#ifdef FLB_HAVE_ZSTD`. Unlike some other optional components such as ARROW/PARQUET (which use `#ifdef FLB_HAVE_ARROW` guards), ZSTD support is always available and doesn't need build-time conditionals. ZSTD headers are included directly without guards across multiple plugins and core components.

Applied to files:

  • include/fluent-bit/aws/flb_aws_compress.h
  • CMakeLists.txt
  • plugins/out_s3/s3.c
  • cmake/kafka.cmake
  • src/aws/flb_aws_compress.c
  • src/CMakeLists.txt
📚 Learning: 2025-08-29T06:24:55.855Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: src/aws/flb_aws_compress.c:52-56
Timestamp: 2025-08-29T06:24:55.855Z
Learning: ZSTD compression is always available in Fluent Bit and does not require conditional compilation guards. Unlike Arrow/Parquet which use #ifdef FLB_HAVE_ARROW guards, ZSTD is built unconditionally with flb_zstd.c included directly in src/CMakeLists.txt and a bundled ZSTD library at lib/zstd-1.5.7/.

Applied to files:

  • include/fluent-bit/aws/flb_aws_compress.h
  • include/fluent-bit/flb_parquet.h
  • src/flb_parquet_impl.cpp
  • CMakeLists.txt
  • plugins/out_s3/s3.c
  • src/aws/flb_aws_compress.c
  • src/CMakeLists.txt
📚 Learning: 2025-08-29T06:24:26.170Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:39-42
Timestamp: 2025-08-29T06:24:26.170Z
Learning: In Fluent Bit, ZSTD compression support is enabled by default and does not require conditional compilation guards (like #ifdef FLB_HAVE_ZSTD) around ZSTD-related code declarations and implementations.

Applied to files:

  • include/fluent-bit/aws/flb_aws_compress.h
  • CMakeLists.txt
  • src/aws/flb_aws_compress.c
  • src/CMakeLists.txt
📚 Learning: 2025-08-29T06:25:27.250Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:93-107
Timestamp: 2025-08-29T06:25:27.250Z
Learning: In Fluent Bit, ZSTD compression is enabled by default and is treated as a core dependency, not requiring conditional compilation guards like `#ifdef FLB_HAVE_ZSTD`. Unlike some other optional components, ZSTD support is always available and doesn't need build-time conditionals.

Applied to files:

  • include/fluent-bit/aws/flb_aws_compress.h
  • CMakeLists.txt
  • src/aws/flb_aws_compress.c
  • src/CMakeLists.txt
📚 Learning: 2025-08-29T06:25:02.561Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:7-7
Timestamp: 2025-08-29T06:25:02.561Z
Learning: In Fluent Bit, ZSTD (zstandard) compression library is bundled directly in the source tree at `lib/zstd-1.5.7` and is built unconditionally as a static library. Unlike optional external dependencies, ZSTD does not use conditional compilation guards like `FLB_HAVE_ZSTD` and is always available. Headers like `<fluent-bit/flb_zstd.h>` can be included directly without guards.

Applied to files:

  • include/fluent-bit/aws/flb_aws_compress.h
  • CMakeLists.txt
  • src/aws/flb_aws_compress.c
  • src/CMakeLists.txt
📚 Learning: 2025-11-21T06:23:29.770Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 11171
File: include/fluent-bit/flb_lib.h:52-53
Timestamp: 2025-11-21T06:23:29.770Z
Learning: In Fluent Bit core (fluent/fluent-bit repository), function descriptions/documentation are not required for newly added functions in header files.

Applied to files:

  • include/fluent-bit/flb_parquet.h
  • examples/s3_output/README.md
📚 Learning: 2025-08-29T06:24:44.797Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: src/aws/flb_aws_compress.c:26-26
Timestamp: 2025-08-29T06:24:44.797Z
Learning: In Fluent Bit, ZSTD support is always available and enabled by default. The build system automatically detects and uses either the system libzstd library or builds the bundled ZSTD version. Unlike other optional dependencies like Arrow which use conditional compilation guards (e.g., FLB_HAVE_ARROW), ZSTD does not require conditional includes or build flags.

Applied to files:

  • CMakeLists.txt
  • src/aws/flb_aws_compress.c
  • src/CMakeLists.txt
📚 Learning: 2025-09-08T11:21:33.975Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 10851
File: include/fluent-bit/flb_simd.h:60-66
Timestamp: 2025-09-08T11:21:33.975Z
Learning: Fluent Bit currently only supports MSVC compiler on Windows, so additional compiler compatibility guards may be unnecessary for Windows-specific code paths.

Applied to files:

  • CMakeLists.txt
  • src/CMakeLists.txt
📚 Learning: 2025-09-04T12:32:46.030Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10825
File: plugins/out_s3/s3.c:0-0
Timestamp: 2025-09-04T12:32:46.030Z
Learning: In Fluent Bit plugins, avoid creating plugin-local config_map entries that duplicate core functionality. The core already provides Retry_Limit parsed into ins->retry_limit, so plugins should use that directly rather than defining their own retry_limit configuration option.

Applied to files:

  • plugins/out_s3/s3.c
📚 Learning: 2025-12-22T05:39:02.291Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 11250
File: lib/ripser-1.2.1/Makefile:1-18
Timestamp: 2025-12-22T05:39:02.291Z
Learning: In fluent/fluent-bit, the Makefile at lib/ripser-1.2.1/Makefile is imported from upstream Ripser and is not used in the actual build process. The project uses CMake for building (lib/ripser-1.2.1/CMakeLists.txt), so changes to the imported Makefile are not necessary.

Applied to files:

  • src/CMakeLists.txt
📚 Learning: 2025-09-04T07:28:37.083Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 10832
File: src/wasm/CMakeLists.txt:112-131
Timestamp: 2025-09-04T07:28:37.083Z
Learning: In fluent-bit CMake files, the user cosmo0920 prefers treating Git as a command rather than a package, emphasizing that Git is not a pkg-config retrievable package but just a command.

Applied to files:

  • src/CMakeLists.txt
📚 Learning: 2025-09-14T09:46:09.531Z
Learnt from: aminvakil
Repo: fluent/fluent-bit PR: 10844
File: conf/fluent-bit:13-15
Timestamp: 2025-09-14T09:46:09.531Z
Learning: For fluent-bit Debian packaging, /opt/fluent-bit/bin/ is the appropriate installation path since the package may be installed from non-official Debian sources, making /opt compliant with FHS for optional software packages.

Applied to files:

  • src/CMakeLists.txt
🧬 Code graph analysis (4)
plugins/out_s3/s3.h (1)
plugins/out_s3/s3.c (1)
  • s3_put_object (1571-1708)
include/fluent-bit/flb_parquet.h (1)
src/flb_parquet_impl.cpp (2)
  • flb_msgpack_raw_to_parquet (580-958)
  • flb_msgpack_raw_to_parquet (580-583)
src/flb_parquet_impl.cpp (1)
include/fluent-bit/flb_mem.h (1)
  • flb_free (126-128)
src/aws/flb_aws_compress.c (2)
src/flb_gzip.c (1)
  • flb_gzip_compress (157-252)
src/flb_zstd.c (1)
  • flb_zstd_compress (33-57)
🪛 markdownlint-cli2 (0.18.1)
examples/s3_output/README.md

7-7: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


91-91: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


130-130: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


135-135: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


140-140: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


149-149: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


162-162: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


292-292: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


814-814: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🔇 Additional comments (50)
cmake/kafka.cmake (4)

9-14: Clear documentation of SASL capabilities.

Good addition of comments explaining librdkafka's built-in SASL mechanisms. This helps maintainers understand why cyrus-sasl is only needed for GSSAPI/Kerberos.


15-35: Cyrus-SASL detection logic looks correct.

The fallback detection when pkg-config is unavailable properly sets SASL_LIBRARIES and SASL_INCLUDE_DIRS, which are used later in line 79-81.


78-82: SASL library linking logic is correct.

The conditional properly guards against appending undefined SASL_LIBRARIES and only links cyrus-sasl when it was actually found.


84-89: Useful feature summary for build diagnostics.

This summary helps quickly verify which Kafka features are enabled during the build.

tests/runtime/out_s3.c (1)

38-38: LGTM! Test configurations now explicitly specify upload modes.

The addition of use_put_object settings and size parameters (total_file_size, upload_chunk_size) makes test intentions clearer:

  • Multipart tests explicitly set use_put_object=false
  • PutObject tests set use_put_object=true with appropriate size boundaries

This improves test maintainability and reduces ambiguity about which upload path is being tested.

Also applies to: 74-75, 112-113, 151-151, 187-187, 223-223

examples/s3_output/README.md (1)

1-868: Comprehensive documentation for S3 output best practices.

This extensive guide covers production deployment scenarios, format selection, cost optimization, and operational considerations. The documentation quality is excellent and provides valuable guidance for users.

Note: Static analysis tools flagged missing language identifiers on some fenced code blocks (lines 7, 91, 149, 162, 292, 814). While these are valid markdown formatting suggestions, they are minor presentation issues and don't affect the technical accuracy or usability of the documentation.

include/fluent-bit/aws/flb_aws_compress.h (1)

24-46: Clear deprecation strategy with comprehensive documentation.

The changes effectively:

  1. Add SNAPPY (2) and ZSTD (3) as true compression algorithms
  2. Clearly document ARROW (4) as deprecated and PARQUET (5) as a format converter
  3. Provide detailed comments explaining the distinction between compression algorithms and file format conversion

The deprecation markers maintain backward compatibility while guiding users toward the correct usage (format=parquet instead of compression=parquet).

Based on learnings, ZSTD compression is always available in Fluent Bit and doesn't require conditional compilation guards, which aligns with this implementation.

src/aws/flb_aws_compress.c (2)

31-37: Snappy compression support added with proper signature adaptation.

The wrapper function correctly adapts flb_snappy_compress signature (char* based) to match the compression option interface (void* based), following the same pattern as gzip and zstd implementations.


45-78: Compression options cleaned up to include only true compression algorithms.

The updated comments and implementation:

  • Clearly distinguish compression algorithms (gzip, snappy, zstd) from format converters
  • Remove ARROW and PARQUET from the compression_options array
  • Document that ARROW support has been removed and PARQUET should use format=parquet

This aligns with the deprecation strategy in the header file and improves clarity.

Based on learnings, ZSTD compression is always available in Fluent Bit and doesn't require conditional compilation guards.

plugins/out_s3/s3_multipart.c (1)

465-466: Improved logging verbosity by using debug level for detailed operations.

The logging changes reduce verbosity in production environments:

  • Routine multipart operations (create, upload part) now use debug level
  • Final completion remains at info level for visibility
  • Sensitive metadata (UploadId, ETag) moved to debug level or removed

This improves the signal-to-noise ratio in logs while retaining important information at appropriate levels.

Also applies to: 621-623, 753-754, 766-768, 776-777

examples/s3_output/fluent-bit-direct-s3.conf (1)

1-272: Comprehensive example configuration demonstrating S3 direct upload patterns.

This configuration file provides excellent guidance for users with:

  • Multiple input source examples (Kubernetes, systemd, application logs, TCP)
  • Proper filter configuration for metadata enrichment
  • Four distinct output scenarios (production with buffering, minimal buffering, high-throughput, multi-destination)
  • Clear documentation of store_dir behavior and recommendations
  • Parquet format usage with schema definitions

The inline comments effectively explain configuration choices and trade-offs, making this a valuable reference for users.

plugins/out_s3/s3.h (3)

44-45: AWS S3 PutObject 5GB limit properly documented.

The constant MAX_FILE_SIZE_PUT_OBJECT is set to 5000000000 (5GB) with a clear comment, matching the AWS S3 PutObject API limitation. This provides a clear boundary for the use_put_object mode.


193-200: Format configuration fields added for JSON and Parquet support.

The new fields enable format selection:

  • format: Specifies output format (JSON or PARQUET)
  • schema_str: Stores JSON schema string for Parquet format configuration

The format constants (FLB_S3_FORMAT_JSON = 0, FLB_S3_FORMAT_PARQUET = 1) provide clear type-safe values for format selection.


230-231: Function signature updated to use SDS buffer type.

The s3_put_object function now uses flb_sds_t buffer instead of char *body, aligning with Fluent Bit's SDS (Simple Dynamic String) memory management pattern. This change provides better memory safety and integrates with the broader SDS-based refactoring in the S3 plugin.

examples/s3_output/fluent-bit-kafka-sink-s3.conf (3)

117-165: Critical guidance on Kafka message processing correctly documented.

The FILTER section documentation clearly explains:

  • Why nest filter (not parser) must be used when Kafka INPUT uses Format json
  • The structure of Kafka messages with nested payload objects
  • The three-step processing pipeline (lift, remove metadata, rename timestamp)
  • Common mistakes to avoid (using parser filter, setting Add_prefix)

This prevents a common pitfall where users might try to parse an already-parsed object.


253-270: High-traffic Parquet configuration demonstrates best practices.

The container logs output configuration shows:

  • Size-based upload trigger (500M) with time-based fallback (60m)
  • Parquet format for high-volume scenarios
  • Properly defined schema with appropriate types and nullability
  • use_put_object=true with clear guidance about when to use multipart
  • preserve_data_ordering=false for better performance in analytics scenarios

The inline comments explain the rationale and expected behavior, making this a valuable reference for production deployments.


304-356: Comprehensive configuration guidelines provide operational clarity.

The guidelines section covers all critical aspects:

  • Bucket configuration and IAM permissions
  • Authentication options (MSK IAM, AWS credentials)
  • Buffer sizing strategy and disk monitoring
  • Upload strategy selection based on traffic patterns
  • Format and compression selection criteria
  • Monitoring and testing recommendations
  • Cost optimization considerations

This makes the configuration file self-documenting and reduces the need for external documentation references.

include/fluent-bit/flb_parquet.h (1)

82-85: Well-documented public API.

The function signature is clean with clear ownership semantics. The extensive documentation covering schema format, supported types, and example usage is excellent for API discoverability.

src/CMakeLists.txt (2)

157-166: LGTM - Parquet encoder source wiring.

The conditional inclusion of flb_parquet_impl.cpp with explicit C++ language setting follows the established pattern for optional encoders in this codebase.


403-410: LGTM - Parquet library dependencies.

The dependency linking pattern mirrors the existing AVRO_ENCODER block (lines 395-401), maintaining consistency with the codebase conventions.

cmake/parquet.cmake (2)

1-38: Well-structured library detection module.

The three-method detection approach (pkg-config → CMake Config → manual search) provides robust cross-platform support with sensible fallbacks.


266-279: Variables properly exported to parent scope.

All discovery results are correctly propagated to the parent scope, enabling downstream CMakeLists.txt files to use the detected libraries.

tests/internal/parquet.c (3)

74-103: LGTM - Basic conversion test.

The test correctly initializes msgpack, calls the API, validates the result, and properly cleans up resources. The pattern is well-structured.


1157-1167: Error handling test validates NULL input rejection.

Good coverage of the NULL input parameter case.


1255-1311: Comprehensive test list registration.

All 38 tests are properly registered with descriptive names organized by category. The NULL sentinel terminates the list correctly.

CMakeLists.txt (3)

202-202: New public option for Parquet encoder.

The option follows the existing naming convention (FLB_*_ENCODER) and defaults to No, which is appropriate for a feature with external dependencies.


284-288: Good decision to exclude Parquet from FLB_ALL.

Requiring explicit -DFLB_PARQUET_ENCODER=On is the right choice since Arrow/Parquet are external dependencies that may not be available on all systems. The comment clearly documents this decision.


1337-1355: Parquet encoder wiring looks correct.

C++17 is properly enabled (required for modern Arrow), and the cmake/parquet.cmake module is included for library detection. The graceful fallback to disable the encoder when libraries aren't found is appropriate.

src/flb_parquet_impl.cpp (8)

39-42: Forward declaration avoids C++ compilation conflicts.

Good solution to avoid including flb_pack.h which transitively includes mpack.h that has C function overloading incompatible with C++ compilation.


179-198: Robust string-to-int32 parsing with overflow handling.

The implementation correctly uses strtol with full error checking (errno, endptr) and handles overflow by clamping to INT32_MIN/INT32_MAX with statistics tracking.


211-217: Overflow handling for uint64 → int64 conversion addressed.

The previous review concern about potential overflow when converting large uint64 values to int64 has been properly addressed with a range check and clamping to INT64_MAX.


365-407: Timestamp handling correctly documents the contract.

The comment clearly states that the schema's unit specifies the output unit, and the input must already be in that unit. This is a reasonable design choice that puts the responsibility on the user to provide correctly-scaled timestamps.


639-728: On-the-fly record processing avoids dangling pointer issue.

The previous review flagged that storing msgpack_object pointers was unsafe. This implementation now processes each record immediately within the unpack loop, correctly avoiding the dangling pointer issue.


730-760: Fluent Bit internal format [timestamp, record] correctly handled.

The code properly detects the two-element array format and processes only the record portion (index 1), which aligns with Fluent Bit's internal data representation.


945-957: Exception handling with proper memory cleanup.

Both parquet::ParquetException and std::exception are caught, with output_buffer being freed if allocated. This ensures no memory leaks on error paths.


51-61: No action required — function signature is correct. The call to flb_msgpack_to_json_str(256, obj, FLB_FALSE) matches the actual function signature declared in include/fluent-bit/flb_pack.h: char* flb_msgpack_to_json_str(size_t size, const msgpack_object *obj, int escape_unicode). All parameters are correctly typed and compatible.

Likely an incorrect or invalid review comment.

plugins/out_s3/s3.c (14)

42-44: LGTM: Parquet header properly guarded.

The conditional include is correctly guarded with FLB_HAVE_PARQUET_ENCODER, consistent with the codebase pattern for optional features.


49-51: LGTM: Forward declaration properly added.

Forward declaration is necessary and signature matches the implementation at line 3355.


1137-1272: LGTM: Comprehensive multipart upload implementation with proper error handling.

The refactored s3_multipart_upload_complete function demonstrates excellent error handling:

  • Heap allocation prevents stack overflow for large structures
  • All allocations have corresponding cleanup paths
  • Sequence index updates include rollback on failure
  • Upload failures trigger abort and cleanup
  • Memory leaks are properly avoided

The function correctly receives pre-compressed data and handles multipart upload logic cleanly.


1274-1343: LGTM: Unified upload interface with clear decision logic.

The function properly:

  • Applies compression once at the entry point (avoiding duplicate compression)
  • Skips outer compression for Parquet format (which has internal compression)
  • Respects user's use_put_object configuration
  • Provides helpful warnings for potentially inefficient usage
  • Manages memory consistently with SDS buffers

The warning at lines 1327-1332 correctly identifies that multipart upload for files < 5MB is inefficient but will still work (as a single-part multipart upload).


1351-1430: LGTM: Buffer management updated to use SDS consistently.

The refactored put_all_chunks function:

  • Correctly initializes buffer to NULL for each chunk
  • Uses the unified s3_upload_file interface
  • Properly destroys SDS buffers after use
  • Handles return code conversion appropriately

Memory management is correct throughout.


1571-1572: LGTM: Signature updated for SDS buffer consistency.

The parameter type change from char *body to flb_sds_t body is consistent with the codebase refactoring to use SDS buffers throughout. The implementation correctly handles the SDS type.


1810-1858: LGTM: Upload request properly refactored with unified interface.

The refactored send_upload_request function:

  • Correctly determines file_first_log_time from upload_file or current time
  • Uses the unified s3_upload_file interface consistently
  • Properly manages SDS buffer lifecycle
  • Handles success/failure paths with appropriate cleanup
  • Returns correct Fluent Bit return codes (FLB_OK/FLB_RETRY)

3278-3353: LGTM: Timer callback updated with improved logging and unified upload interface.

The cb_s3_upload callback improvements:

  • Added informative log message when upload_timeout is reached (line 3310)
  • Checks retry limit before attempting upload (line 3314)
  • Uses unified s3_upload_file interface consistently
  • Properly manages SDS buffer lifecycle
  • Handles success/failure cases with appropriate cleanup and logging
  • Correctly nullifies buffer pointer after destruction (line 3337)

3510-3554: LGTM: Unit test flush function updated consistently.

The unit_test_flush function updates:

  • Uses unified s3_upload_file interface
  • Properly manages SDS buffer lifecycle
  • Handles success/failure with appropriate cleanup
  • Consistent with the refactoring throughout the file

3780-3789: LGTM: Excellent design decision to defer format conversion.

Storing raw msgpack data and deferring format conversion until upload time (line 3780-3789) is a smart optimization:

  • Avoids repeated conversions when data is buffered
  • Enables better batching for columnar formats like Parquet
  • Simplifies the buffering logic
  • Reduces CPU overhead during high-throughput scenarios

The comment clearly explains the rationale.


3894-3951: LGTM: New upload path provides cleaner direct upload flow.

The new preserve_data_ordering=false path (lines 3894-3951):

  • Provides a simpler, more efficient upload flow
  • Buffers the final chunk, constructs the complete buffer, and uploads immediately
  • Eliminates queue overhead when ordering is not required
  • Includes excellent user feedback with specific logging for timeout vs size thresholds
  • Properly manages memory with SDS buffers

This is a significant improvement over the queue-based approach when ordering guarantees aren't needed.


4075-4090: LGTM: Configuration map entries properly structured.

The new configuration entries:

  • compression: Correctly uses offset 0 for manual parsing (lines 689-717), avoiding the type mismatch issue
  • format: Provides clear documentation distinguishing json vs parquet with appropriate defaults
  • schema_str: Properly stores directly to context with helpful example in description

The descriptions are clear and provide sufficient guidance for users. The manual parsing approach for compression and format is appropriate given the complex validation logic.


4099-4105: LGTM: Configuration documentation significantly improved.

The updated descriptions provide much clearer guidance:

  • store_dir (4099-4105): Explains the buffering behavior difference between PutObject and multipart modes
  • use_put_object (4149-4152): Clearly recommends PutObject as the default, explains the 5GB limit, and advises when to use multipart

These improvements will help users configure the plugin correctly for their use cases.

Also applies to: 4149-4152


4241-4246: LGTM: Upload timeout configuration properly documented.

The upload_timeout entry:

  • Has a reasonable default of 30 minutes
  • Correctly describes behavior for both PutObject and multipart modes
  • Aligns with the unified upload interface implementation

Signed-off-by: Arbin Cheng <[email protected]>
Signed-off-by: Arbin <[email protected]>
Signed-off-by: Arbin Cheng <[email protected]>
Signed-off-by: Arbin <[email protected]>
Signed-off-by: Arbin Cheng <[email protected]>
Signed-off-by: Arbin <[email protected]>
Signed-off-by: Arbin Cheng <[email protected]>
Signed-off-by: Arbin <[email protected]>
Signed-off-by: Arbin Cheng <[email protected]>
Signed-off-by: Arbin <[email protected]>
Signed-off-by: Arbin Cheng <[email protected]>
Signed-off-by: Arbin <[email protected]>
Signed-off-by: Arbin Cheng <[email protected]>
Signed-off-by: Arbin <[email protected]>
@kalavt kalavt force-pushed the feature/aws-s3-parquet branch from 37db121 to 69598e3 Compare December 29, 2025 16:32
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
examples/s3_output/README.md (1)

1-834: LGTM: Comprehensive production documentation.

This documentation provides extensive best practices, architecture guidance, configuration examples, and troubleshooting information for S3 output with Parquet support. The examples cover different traffic profiles (low/medium/high) with appropriate format and compression choices.

For enhanced readability, consider adding language specifiers to fenced code blocks (e.g., bash, sql, ```yaml) where appropriate, as noted by the static analysis tool.

plugins/out_s3/s3.c (2)

1145-1280: Consider the performance implications of synchronous multipart uploads.

The multipart upload implementation at lines 1217-1245 uploads parts sequentially in a synchronous loop. For files approaching the maximum size (5GB), this will block the worker thread for an extended period.

While the synchronous design is intentional (see line 1026 comment about S3 requiring sync mode), consider:

  • Adding progress logging for large uploads (e.g., every N parts)
  • Documenting the expected behavior for large file uploads in user-facing documentation

This is not a critical issue but could impact user experience with large files.


1498-1546: Verify necessity of runtime schema_str validation.

Lines 1504-1512 perform a runtime check for schema_str when format=parquet, but initialization already validates this requirement at lines 749-752 (fail-fast during init).

This defensive check:

  • Is safe but redundant since invalid config cannot reach runtime
  • Increments chunk->failures counter which may be unnecessary
  • Could cause confusion during debugging

Consider removing the runtime check or converting it to an assertion/debug check since this condition should never occur in production.

src/aws/flb_aws_compress.c (1)

53-54: Clarify that PARQUET has also been removed from compression options.

Line 54 states "PARQUET: Valid file format converter (deprecated: use format=parquet instead)" but PARQUET is not present in the compression_options array below (lines 57-78). This is inconsistent with the implementation.

Since PARQUET has been removed from the compression pathway (just like ARROW), consider updating the comment to:

*    - PARQUET: REMOVED from compression options (use format=parquet instead for file format conversion)

This clarifies that while Parquet is still supported as an output format via the format parameter, it's no longer available through the compression API.

🔎 Suggested comment clarification
  * 2. Format converters:
  *    - ARROW: REMOVED - Arrow support has been removed as it was not a proper file format for S3
- *    - PARQUET: Valid file format converter (deprecated: use format=parquet instead)
+ *    - PARQUET: REMOVED from compression options (use format=parquet instead for file format conversion)
  * 3. Supported S3 output formats: json (FLB_S3_FORMAT_JSON), parquet (FLB_S3_FORMAT_PARQUET)
include/fluent-bit/aws/flb_aws_compress.h (1)

33-46: Consider clarifying the backward compatibility behavior of deprecated values.

The documentation states "Valid file format: PARQUET (5)" and keeps both ARROW=4 and PARQUET=5 for backward compatibility. However, these values are no longer present in the compression_options array in flb_aws_compress.c, which means calling flb_aws_compression_compress() with these values will fail.

While keeping these enum values prevents compilation errors, calling the compression API with compression_type=4 or 5 will result in a runtime error. Consider adding a note about this runtime behavior to prevent confusion:

/*
 * File format conversion (NOT compression algorithms)
 *
 * DEPRECATED: FLB_AWS_COMPRESS_ARROW (4)
 *   - Arrow is not a proper file format for S3
 *   - This value is kept only for backward compatibility to avoid compilation errors
 *   - DO NOT USE in new code - will fail at runtime if passed to compression functions
 *
 * DEPRECATED: FLB_AWS_COMPRESS_PARQUET (5)
 *   - Use format=parquet instead of compression=parquet (deprecated usage)
 *   - This value is kept for backward compatibility but will fail if passed to compression functions
 *   - Supported S3 output formats: json (FLB_S3_FORMAT_JSON), parquet (FLB_S3_FORMAT_PARQUET)
 */
🔎 Suggested documentation improvement
  * File format conversion (NOT compression algorithms)
  *
  * DEPRECATED: FLB_AWS_COMPRESS_ARROW (4)
  *   - Arrow is not a proper file format for S3
  *   - This value is kept only for backward compatibility to avoid compilation errors
- *   - DO NOT USE in new code
+ *   - DO NOT USE in new code - will fail at runtime if passed to compression functions
  *
- * Valid file format: PARQUET (5)
+ * DEPRECATED: FLB_AWS_COMPRESS_PARQUET (5)
  *   - Use format=parquet instead of compression=parquet (deprecated usage)
+ *   - This value is kept for backward compatibility but will fail if passed to compression functions
  *   - Supported S3 output formats: json (FLB_S3_FORMAT_JSON), parquet (FLB_S3_FORMAT_PARQUET)
  */
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 37db121 and 69598e3.

📒 Files selected for processing (18)
  • .github/workflows/unit-tests.yaml
  • CMakeLists.txt
  • cmake/parquet.cmake
  • examples/s3_output/README.md
  • examples/s3_output/fluent-bit-direct-s3.conf
  • examples/s3_output/fluent-bit-kafka-sink-s3.conf
  • include/fluent-bit/aws/flb_aws_compress.h
  • plugins/out_s3/s3.c
  • plugins/out_s3/s3.h
  • plugins/out_s3/s3_multipart.c
  • src/CMakeLists.txt
  • src/aws/CMakeLists.txt
  • src/aws/compression/CMakeLists.txt
  • src/aws/compression/arrow/CMakeLists.txt
  • src/aws/compression/arrow/compress.c
  • src/aws/compression/arrow/compress.h
  • src/aws/flb_aws_compress.c
  • tests/runtime/out_s3.c
💤 Files with no reviewable changes (5)
  • src/aws/compression/arrow/CMakeLists.txt
  • src/aws/compression/arrow/compress.h
  • src/aws/compression/CMakeLists.txt
  • src/aws/CMakeLists.txt
  • src/aws/compression/arrow/compress.c
🚧 Files skipped from review as they are similar to previous changes (3)
  • .github/workflows/unit-tests.yaml
  • cmake/parquet.cmake
  • examples/s3_output/fluent-bit-direct-s3.conf
🧰 Additional context used
🧠 Learnings (16)
📓 Common learnings
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: src/aws/flb_aws_compress.c:52-56
Timestamp: 2025-08-29T06:24:55.855Z
Learning: ZSTD compression is always available in Fluent Bit and does not require conditional compilation guards. Unlike Arrow/Parquet which use #ifdef FLB_HAVE_ARROW guards, ZSTD is built unconditionally with flb_zstd.c included directly in src/CMakeLists.txt and a bundled ZSTD library at lib/zstd-1.5.7/.
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:93-107
Timestamp: 2025-08-29T06:25:27.250Z
Learning: In Fluent Bit, ZSTD compression is enabled by default and is treated as a core dependency, not requiring conditional compilation guards like `#ifdef FLB_HAVE_ZSTD`. Unlike some other optional components such as ARROW/PARQUET (which use `#ifdef FLB_HAVE_ARROW` guards), ZSTD support is always available and doesn't need build-time conditionals. ZSTD headers are included directly without guards across multiple plugins and core components.
📚 Learning: 2025-08-31T12:46:11.940Z
Learnt from: ThomasDevoogdt
Repo: fluent/fluent-bit PR: 9277
File: .github/workflows/pr-compile-check.yaml:147-151
Timestamp: 2025-08-31T12:46:11.940Z
Learning: In fluent-bit CMakeLists.txt, the system library preference flags are defined as FLB_PREFER_SYSTEM_LIB_ZSTD and FLB_PREFER_SYSTEM_LIB_KAFKA with the FLB_ prefix.

Applied to files:

  • CMakeLists.txt
  • src/CMakeLists.txt
📚 Learning: 2025-08-31T12:46:11.940Z
Learnt from: ThomasDevoogdt
Repo: fluent/fluent-bit PR: 9277
File: .github/workflows/pr-compile-check.yaml:147-151
Timestamp: 2025-08-31T12:46:11.940Z
Learning: In fluent-bit, the correct CMake flag for using system librdkafka is `FLB_PREFER_SYSTEM_LIB_KAFKA=ON`.

Applied to files:

  • CMakeLists.txt
  • src/CMakeLists.txt
📚 Learning: 2025-08-29T06:25:27.250Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:93-107
Timestamp: 2025-08-29T06:25:27.250Z
Learning: In Fluent Bit, ZSTD compression is enabled by default and is treated as a core dependency, not requiring conditional compilation guards like `#ifdef FLB_HAVE_ZSTD`. Unlike some other optional components such as ARROW/PARQUET (which use `#ifdef FLB_HAVE_ARROW` guards), ZSTD support is always available and doesn't need build-time conditionals. ZSTD headers are included directly without guards across multiple plugins and core components.

Applied to files:

  • CMakeLists.txt
  • include/fluent-bit/aws/flb_aws_compress.h
  • src/aws/flb_aws_compress.c
  • src/CMakeLists.txt
  • plugins/out_s3/s3.c
📚 Learning: 2025-08-29T06:24:44.797Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: src/aws/flb_aws_compress.c:26-26
Timestamp: 2025-08-29T06:24:44.797Z
Learning: In Fluent Bit, ZSTD support is always available and enabled by default. The build system automatically detects and uses either the system libzstd library or builds the bundled ZSTD version. Unlike other optional dependencies like Arrow which use conditional compilation guards (e.g., FLB_HAVE_ARROW), ZSTD does not require conditional includes or build flags.

Applied to files:

  • CMakeLists.txt
  • src/aws/flb_aws_compress.c
  • src/CMakeLists.txt
📚 Learning: 2025-08-29T06:24:55.855Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: src/aws/flb_aws_compress.c:52-56
Timestamp: 2025-08-29T06:24:55.855Z
Learning: ZSTD compression is always available in Fluent Bit and does not require conditional compilation guards. Unlike Arrow/Parquet which use #ifdef FLB_HAVE_ARROW guards, ZSTD is built unconditionally with flb_zstd.c included directly in src/CMakeLists.txt and a bundled ZSTD library at lib/zstd-1.5.7/.

Applied to files:

  • CMakeLists.txt
  • include/fluent-bit/aws/flb_aws_compress.h
  • src/aws/flb_aws_compress.c
  • src/CMakeLists.txt
  • plugins/out_s3/s3.c
📚 Learning: 2025-08-29T06:25:27.250Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:93-107
Timestamp: 2025-08-29T06:25:27.250Z
Learning: In Fluent Bit, ZSTD compression is enabled by default and is treated as a core dependency, not requiring conditional compilation guards like `#ifdef FLB_HAVE_ZSTD`. Unlike some other optional components, ZSTD support is always available and doesn't need build-time conditionals.

Applied to files:

  • CMakeLists.txt
  • include/fluent-bit/aws/flb_aws_compress.h
  • src/aws/flb_aws_compress.c
  • src/CMakeLists.txt
📚 Learning: 2025-08-29T06:24:26.170Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:39-42
Timestamp: 2025-08-29T06:24:26.170Z
Learning: In Fluent Bit, ZSTD compression support is enabled by default and does not require conditional compilation guards (like #ifdef FLB_HAVE_ZSTD) around ZSTD-related code declarations and implementations.

Applied to files:

  • CMakeLists.txt
  • include/fluent-bit/aws/flb_aws_compress.h
  • src/aws/flb_aws_compress.c
  • src/CMakeLists.txt
📚 Learning: 2025-08-29T06:25:02.561Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:7-7
Timestamp: 2025-08-29T06:25:02.561Z
Learning: In Fluent Bit, ZSTD (zstandard) compression library is bundled directly in the source tree at `lib/zstd-1.5.7` and is built unconditionally as a static library. Unlike optional external dependencies, ZSTD does not use conditional compilation guards like `FLB_HAVE_ZSTD` and is always available. Headers like `<fluent-bit/flb_zstd.h>` can be included directly without guards.

Applied to files:

  • CMakeLists.txt
  • include/fluent-bit/aws/flb_aws_compress.h
  • src/aws/flb_aws_compress.c
  • src/CMakeLists.txt
📚 Learning: 2025-09-08T11:21:33.975Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 10851
File: include/fluent-bit/flb_simd.h:60-66
Timestamp: 2025-09-08T11:21:33.975Z
Learning: Fluent Bit currently only supports MSVC compiler on Windows, so additional compiler compatibility guards may be unnecessary for Windows-specific code paths.

Applied to files:

  • CMakeLists.txt
  • src/CMakeLists.txt
📚 Learning: 2025-09-04T12:35:22.872Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10825
File: plugins/out_s3/s3.c:1339-1344
Timestamp: 2025-09-04T12:35:22.872Z
Learning: In the Fluent Bit S3 plugin, the user prefers to maintain current retry_limit behavior without special handling for FLB_OUT_RETRY_UNLIMITED (-1), as there's no documentation indicating -1 should be used for infinite retries and consistency with current logic is preferred.

Applied to files:

  • tests/runtime/out_s3.c
  • plugins/out_s3/s3.c
📚 Learning: 2025-09-04T07:28:37.083Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 10832
File: src/wasm/CMakeLists.txt:112-131
Timestamp: 2025-09-04T07:28:37.083Z
Learning: In fluent-bit CMake files, the user cosmo0920 prefers treating Git as a command rather than a package, emphasizing that Git is not a pkg-config retrievable package but just a command.

Applied to files:

  • src/CMakeLists.txt
📚 Learning: 2025-09-14T09:46:09.531Z
Learnt from: aminvakil
Repo: fluent/fluent-bit PR: 10844
File: conf/fluent-bit:13-15
Timestamp: 2025-09-14T09:46:09.531Z
Learning: For fluent-bit Debian packaging, /opt/fluent-bit/bin/ is the appropriate installation path since the package may be installed from non-official Debian sources, making /opt compliant with FHS for optional software packages.

Applied to files:

  • src/CMakeLists.txt
📚 Learning: 2025-11-21T06:23:29.770Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 11171
File: include/fluent-bit/flb_lib.h:52-53
Timestamp: 2025-11-21T06:23:29.770Z
Learning: In Fluent Bit core (fluent/fluent-bit repository), function descriptions/documentation are not required for newly added functions in header files.

Applied to files:

  • examples/s3_output/README.md
📚 Learning: 2025-09-04T12:35:36.904Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10825
File: plugins/out_s3/s3.c:3275-3282
Timestamp: 2025-09-04T12:35:36.904Z
Learning: The out_s3 plugin intentionally uses a simple numeric comparison for retry_limit (chunk->failures >= ctx->ins->retry_limit) rather than the standard Fluent Bit pattern that checks for FLB_OUT_RETRY_UNLIMITED (-1). The maintainer wants to keep this current behavior for consistency within the plugin.

Applied to files:

  • plugins/out_s3/s3.c
📚 Learning: 2025-09-04T12:32:46.030Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10825
File: plugins/out_s3/s3.c:0-0
Timestamp: 2025-09-04T12:32:46.030Z
Learning: In Fluent Bit plugins, avoid creating plugin-local config_map entries that duplicate core functionality. The core already provides Retry_Limit parsed into ins->retry_limit, so plugins should use that directly rather than defining their own retry_limit configuration option.

Applied to files:

  • plugins/out_s3/s3.c
🧬 Code graph analysis (2)
tests/runtime/out_s3.c (1)
src/flb_lib.c (1)
  • flb_output_set (520-551)
plugins/out_s3/s3.h (1)
plugins/out_s3/s3.c (1)
  • s3_put_object (1579-1716)
🪛 markdownlint-cli2 (0.18.1)
examples/s3_output/README.md

7-7: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


91-91: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


130-130: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


135-135: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


140-140: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


149-149: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


162-162: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


292-292: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


814-814: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🔇 Additional comments (23)
plugins/out_s3/s3_multipart.c (1)

465-466: LGTM: Logging adjustments reduce production noise.

The logging level changes move successful operation messages from info to debug level, which is appropriate for reducing verbosity in production environments while maintaining detailed logs for troubleshooting.

Also applies to: 621-623, 753-754, 766-768, 776-777

examples/s3_output/fluent-bit-kafka-sink-s3.conf (1)

1-356: LGTM: Well-documented example configuration.

The configuration file provides clear examples for different traffic profiles (low/medium/high) with appropriate format choices (JSON vs. Parquet), buffering strategies, and comprehensive inline documentation. The filter chain for Kafka payload processing is correctly structured.

tests/runtime/out_s3.c (1)

38-38: LGTM: Test configurations updated for explicit upload mode selection.

The test updates explicitly configure use_put_object to distinguish between multipart upload tests (false) and PutObject tests (true). The increased total_file_size (10M) and added upload_chunk_size (5M) align with the updated API and constants in s3.h.

Also applies to: 74-75, 112-113, 151-151, 187-187, 223-223

plugins/out_s3/s3.h (3)

44-45: Correct AWS S3 PutObject limit updated to 5GB.

The updated constant correctly reflects AWS S3's PutObject API maximum object size of 5GB. Objects larger than this require multipart upload.


193-200: LGTM: Format selection infrastructure added.

The new format field and FLB_S3_FORMAT_JSON/FLB_S3_FORMAT_PARQUET macros provide clean format selection for S3 outputs. The schema_str field stores the user-defined JSON schema for Parquet format as documented in the README.


230-231: s3_put_object signature updated for SDS-based buffers.

The signature change from char *buffer to flb_sds_t buffer with explicit size_t size aligns with the SDS-based buffer management introduced in this PR. This is consistent with the implementation in s3.c (lines 1578-1715).

src/CMakeLists.txt (2)

157-166: LGTM: Parquet encoder C++ source properly configured.

The Parquet encoder integration correctly:

  • Guards the C++ implementation with FLB_PARQUET_ENCODER
  • Explicitly marks flb_parquet_impl.cpp as CXX source
  • Includes clear comments about C++ requirement

403-410: LGTM: Arrow/Parquet libraries linked conditionally.

The dependency wiring correctly adds ${ARROW_LIBRARIES} and ${PARQUET_LIBRARIES} to FLB_DEPS only when FLB_PARQUET_ENCODER is enabled, consistent with the detection logic in cmake/parquet.cmake.

CMakeLists.txt (2)

202-202: LGTM: Parquet encoder properly integrated with C++17.

The Parquet encoder integration correctly:

  • Adds FLB_PARQUET_ENCODER as an opt-in option (not enabled by FLB_ALL)
  • Enables C++ language support and sets C++17 standard (required by Arrow/Parquet)
  • Includes cmake/parquet.cmake for library detection
  • Defines FLB_HAVE_PARQUET_ENCODER only when both Arrow and Parquet are found
  • Gracefully disables the feature with warnings if dependencies are missing

The C++17 requirement addresses the previous review comment about Apache Arrow/Parquet compatibility.

Also applies to: 1337-1355


1077-1134: Improved libyaml detection for macOS Homebrew installations.

The enhanced detection logic adds automatic discovery of Homebrew-installed libyaml on macOS by querying brew --prefix libyaml. This improves the developer experience on macOS where Homebrew is commonly used for dependency management.

The fallback chain is well-structured:

  1. Try pkg-config first
  2. Check user-specified FLB_LIBYAML_DIR
  3. Auto-detect Homebrew installation on macOS
  4. Fall back to compile test
  5. Fail with clear error message if not found
plugins/out_s3/s3.c (9)

42-51: LGTM: Parquet includes and forward declarations are properly guarded.

The conditional Parquet include and forward declaration are correctly structured.


686-768: LGTM: Compression and format configuration handling is well-implemented.

The configuration logic correctly:

  • Handles legacy compression=arrow|parquet values with deprecation warnings and defaults to GZIP compression for Parquet
  • Validates that schema_str is required for Parquet format during initialization (fail-fast)
  • Provides helpful guidance when Parquet is used without compression
  • Includes clear error messages for unavailable formats based on build configuration

Based on learnings, this addresses previous concerns about compression defaults and early validation.


1282-1351: LGTM: Unified upload interface correctly handles compression and upload method selection.

The implementation correctly:

  • Applies compression once at the entry point for both PutObject and Multipart uploads (efficient)
  • Skips outer compression for Parquet format since it uses internal compression (line 1298)
  • Provides helpful warnings for inefficient configurations (lines 1335-1340)
  • Manages memory consistently with proper SDS buffer cleanup
  • Respects user configuration for upload method selection

1514-1543: LGTM: Parquet buffer memory management is correct.

The Parquet conversion properly handles memory ownership:

  1. Arrow/Parquet library allocates parquet_buffer (line 1514)
  2. Creates SDS copy for consistent memory management (line 1532)
  3. Frees original Arrow-allocated buffer (line 1533)
  4. Returns SDS buffer to caller (line 1573)

Error handling correctly cleans up resources on failure.


3788-3797: LGTM: Storing raw msgpack data enables efficient format conversion.

The change to store raw msgpack data (line 3792) instead of pre-converting to JSON is a good design decision that:

  • Enables better batching for columnar formats like Parquet
  • Avoids repeated conversions when data is buffered
  • Supports multiple output formats from the same buffer
  • Defers format conversion until upload time

The comment clearly explains this rationale.


3869-3960: LGTM: Dual upload paths correctly implement different ordering guarantees.

The implementation correctly provides two upload strategies:

preserve_data_ordering=true (lines 3872-3900):

  • Buffers chunk, locks file, adds to queue
  • Queue timer handles actual upload to ensure ordering
  • Necessary for strict ordering requirements

preserve_data_ordering=false (lines 3902-3959):

  • Buffers chunk, reads complete data, uploads immediately
  • Simpler and more efficient (no queue overhead)
  • Default behavior for most use cases

Both paths have proper error handling, cleanup, and informative logging.


4042-4049: LGTM: Configuration descriptions clearly explain buffer size parameters.

The updated descriptions for total_file_size and upload_chunk_size clearly explain:

  • total_file_size: The buffer threshold that triggers upload
  • upload_chunk_size: Only relevant for multipart uploads, ignored when use_put_object=true

This helps users understand the difference between these two parameters and their use cases.


4083-4097: LGTM: New configuration options are well-documented.

The new configuration entries provide clear guidance:

  • compression: Lists supported values and notes deprecated legacy values
  • format: Explains JSON vs Parquet with use case guidance for analytics
  • schema_str: Clearly marked as required for Parquet with a helpful example

The descriptions will help users configure Parquet support correctly.


4249-4254: LGTM: upload_timeout configuration is clear and well-documented.

The new upload_timeout parameter:

  • Has a reasonable default (30 minutes)
  • Clearly explains the timeout behavior
  • Applies to both PutObject and multipart upload methods
src/aws/flb_aws_compress.c (3)

31-37: LGTM - Wrapper adapts snappy signature correctly.

The wrapper function properly adapts the flb_snappy_compress signature to match the compression API's function pointer type. The casts from void * to char * are safe for byte buffers.


59-76: LGTM - Compression options correctly reflect true compression algorithms.

The array now correctly contains only true compression algorithms (gzip, snappy, zstd) and properly removes ARROW/PARQUET entries, aligning with the design goal of separating compression from file format conversion.


26-26: Remove this review comment—Snappy is a mandatory, unconditional dependency in Fluent Bit.

Snappy is bundled in the source tree at lib/snappy-fef67ac and built unconditionally via add_subdirectory() in CMakeLists.txt. Unlike optional components (Arrow/Parquet), Snappy does not require conditional guards like #ifdef FLB_HAVE_SNAPPY. The include at line 26 follows the established pattern throughout the codebase: all core files and plugins include <fluent-bit/flb_snappy.h> without guards, just as with ZSTD.

Likely an incorrect or invalid review comment.

include/fluent-bit/aws/flb_aws_compress.h (1)

24-31: LGTM - Compression algorithm values are well-defined.

The enum values for true compression algorithms (GZIP=1, SNAPPY=2, ZSTD=3) are clearly documented and sequential.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant