Skip to content

update: add custom field p_format_verified #1303

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Apr 26, 2025

Conversation

nikhilsinhaparseable
Copy link
Contributor

@nikhilsinhaparseable nikhilsinhaparseable commented Apr 25, 2025

if server detects known format, add p_format_verified=true to the event
if schema detection fails, add p_format_verified=false to the event

Summary by CodeRabbit

  • New Features

    • Added a verification status field to custom fields, indicating whether schema extraction was successful for each event.
  • Refactor

    • Streamlined the handling and propagation of custom fields during event ingestion for improved consistency.
    • Updated function signatures to use references for better performance and clarity.
  • Bug Fixes

    • Improved pattern matching in log format parsing to enhance accuracy of extracted fields.
    • Enhanced flexibility of log format patterns to support additional log variants.

Copy link

coderabbitai bot commented Apr 25, 2025

"""

Walkthrough

This change refactors the event ingestion and schema extraction workflow to consistently use a mutable custom fields map (p_custom_fields) for tracking schema verification status. Instead of returning errors when schema extraction fails, the code now inserts a verification key (FORMAT_VERIFY_KEY) with a value of "true" or "false" into the custom fields. The function signatures for extracting custom fields from HTTP headers and inline logs are updated to use references, ensuring custom fields are propagated and updated throughout the ingestion pipeline. A new constant for the verification key is introduced.

Changes

File(s) Change Summary
src/event/format/known_schema.rs Modified schema extraction to insert FORMAT_VERIFY_KEY with "true" or "false" into custom fields instead of returning errors. Updated extract_from_inline_log signature to accept a mutable custom fields map.
src/event/mod.rs Added new public constant FORMAT_VERIFY_KEY for schema verification status.
src/handlers/http/ingest.rs Updated ingestion functions to extract, propagate, and update mutable custom fields throughout the pipeline. Unified header extraction via reference.
src/handlers/http/modal/utils/ingest_utils.rs Changed get_custom_fields_from_header to accept a reference to HttpRequest. Updated tests accordingly.
resources/formats.json Modified regex pattern for the "alb_log" format, broadening matching criteria for user_agent, ssl_cipher, and ssl_protocol fields; added two new alternative regex patterns for "alb_log".

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant HTTPHandler
    participant EventProcessor
    participant SchemaDefinition

    Client->>HTTPHandler: Send event with headers
    HTTPHandler->>HTTPHandler: Extract custom fields via get_custom_fields_from_header(&req)
    HTTPHandler->>EventProcessor: extract_from_inline_log(json, &mut p_custom_fields, ...)
    EventProcessor->>SchemaDefinition: check_or_extract(...)
    alt Pattern matches
        SchemaDefinition->>EventProcessor: Return extracted fields
        EventProcessor->>p_custom_fields: Insert FORMAT_VERIFY_KEY = "true"
    else Pattern does not match
        SchemaDefinition->>EventProcessor: Extraction fails
        EventProcessor->>p_custom_fields: Insert FORMAT_VERIFY_KEY = "false"
    end
    HTTPHandler->>HTTPHandler: flatten_and_push_logs(..., p_custom_fields)
Loading

Possibly related PRs

Suggested labels

for next release

Suggested reviewers

  • de-sh

Poem

A bunny hopped through fields of code,
With custom keys in every load.
No more errors, just a sign,
"Verified!" or "Not this time."
Fields are checked and marked with care,
Now every log is well aware! 🐇
"""


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f217b3d and 226bb5f.

📒 Files selected for processing (1)
  • resources/formats.json (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • resources/formats.json
⏰ Context from checks skipped due to timeout of 90000ms (9)
  • GitHub Check: Build Default x86_64-pc-windows-msvc
  • GitHub Check: Build Default aarch64-apple-darwin
  • GitHub Check: Build Kafka aarch64-apple-darwin
  • GitHub Check: Build Kafka x86_64-unknown-linux-gnu
  • GitHub Check: Quest Smoke and Load Tests for Distributed deployments
  • GitHub Check: Build Default aarch64-unknown-linux-gnu
  • GitHub Check: Build Default x86_64-unknown-linux-gnu
  • GitHub Check: Quest Smoke and Load Tests for Standalone deployments
  • GitHub Check: coverage

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/event/format/known_schema.rs (1)

125-224: Suggest updating unit tests for verification status

The test cases should be updated to verify that the verification status is correctly added to both the JSON objects and the custom fields map.

You could add assertions to check for the presence and value of the p_format_verified field in your existing test cases, particularly in tests like test_apache_log_extraction, test_no_match, and test_extract_from_inline_log_object.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 49ea3bd and 0a3ae7a.

📒 Files selected for processing (4)
  • src/event/format/known_schema.rs (4 hunks)
  • src/event/mod.rs (1 hunks)
  • src/handlers/http/ingest.rs (7 hunks)
  • src/handlers/http/modal/utils/ingest_utils.rs (5 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
src/handlers/http/ingest.rs (1)
src/handlers/http/modal/utils/ingest_utils.rs (1)
  • get_custom_fields_from_header (148-206)
⏰ Context from checks skipped due to timeout of 90000ms (10)
  • GitHub Check: Build Default x86_64-pc-windows-msvc
  • GitHub Check: Build Kafka aarch64-apple-darwin
  • GitHub Check: Build Default aarch64-apple-darwin
  • GitHub Check: Build Default x86_64-apple-darwin
  • GitHub Check: Build Kafka x86_64-unknown-linux-gnu
  • GitHub Check: Build Default aarch64-unknown-linux-gnu
  • GitHub Check: Quest Smoke and Load Tests for Standalone deployments
  • GitHub Check: Build Default x86_64-unknown-linux-gnu
  • GitHub Check: Quest Smoke and Load Tests for Distributed deployments
  • GitHub Check: coverage
🔇 Additional comments (16)
src/event/mod.rs (1)

41-41: Clean implementation of the verification key constant

The new constant FORMAT_VERIFY_KEY with value "p_format_verified" is well-named and properly aligned with other event-related key constants in this file. This is a good pattern for centralizing the constant definition to avoid string duplication.

src/handlers/http/modal/utils/ingest_utils.rs (2)

148-148: Good API improvement: Using reference instead of ownership

Modifying the function to accept a reference to HttpRequest rather than taking ownership is a better practice since the function doesn't need to consume the request.


220-220: Consistently updated test cases

All test calls have been properly updated to pass references instead of moving ownership of the request, maintaining consistency with the function signature change.

Also applies to: 233-233, 246-246, 258-258

src/handlers/http/ingest.rs (7)

84-85: Good approach to initialize custom fields early

Initializing the mutable p_custom_fields at this point allows the verification status to be tracked throughout the ingestion process. The mutability is necessary as the format verification will update this map.


87-92: Consistent parameter passing for schema verification

The updated parameter passing to extract_from_inline_log correctly passes a mutable reference to p_custom_fields, enabling the verification status to be updated during schema extraction.


206-206: Unified parameter passing in OTEL logs handler

The get_custom_fields_from_header call now consistently passes a reference to the request, matching the updated function signature.


264-264: Unified parameter passing in OTEL metrics handler

Same improvement as in the logs handler, maintaining consistency across all ingestion handlers.


323-323: Unified parameter passing in OTEL traces handler

Same improvement as in the other handlers, completing the consistent pattern across all ingestion routes.


371-371: Mutable custom fields in post_event function

The custom fields map is properly initialized as mutable to support the format verification updates.


377-382: Improved error handling approach

Instead of returning errors when schema extraction fails, the code now marks the verification status in the custom fields map. This allows the ingestion pipeline to continue processing events even when format extraction fails.

src/event/format/known_schema.rs (6)

27-27: Properly imported verification key constant

Using the constant from the event module maintains consistency and centralizes the definition.


125-130: Schema verification success tracking

When pattern matching succeeds, the code correctly adds the verification key with value "true" to mark successful format verification.


135-139: Schema verification failure tracking

When pattern matching fails, the code properly adds the verification key with value "false" to indicate format verification failure.


195-195: Updated function signature to support verification

The extract_from_inline_log method now accepts a mutable reference to the custom fields map, allowing it to update the map with verification status.


213-215: Format verification for array events

For events in an array where schema extraction fails, the verification status is appropriately recorded in the custom fields map.


222-224: Format verification for object events

Similar to array events, object events that fail schema extraction are also properly marked with the verification status.

coderabbitai[bot]
coderabbitai bot previously approved these changes Apr 25, 2025
if server detects known format, add `p_format_verified=true` to the event
if schema detection fails, add `p_format_verified=false` to the event
coderabbitai[bot]
coderabbitai bot previously approved these changes Apr 26, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
resources/formats.json (1)

113-149: Second alb_log alternative pattern is valid; consider simplifying alternations
This new, more permissive pattern broadens support for missing or variable-length fields and constrains track_id to the TID_[a-f0-9]+ format. The fields array matches the named groups exactly. One optional refactor: the alternation [^ ]+|[^ ]* can be simplified to [^ ]* without changing semantics, reducing regex verbosity.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1675057 and f217b3d.

📒 Files selected for processing (1)
  • resources/formats.json (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (10)
  • GitHub Check: Build Default x86_64-pc-windows-msvc
  • GitHub Check: Build Kafka aarch64-apple-darwin
  • GitHub Check: Build Default aarch64-apple-darwin
  • GitHub Check: Build Kafka x86_64-unknown-linux-gnu
  • GitHub Check: Build Default x86_64-apple-darwin
  • GitHub Check: Build Default aarch64-unknown-linux-gnu
  • GitHub Check: Quest Smoke and Load Tests for Standalone deployments
  • GitHub Check: coverage
  • GitHub Check: Build Default x86_64-unknown-linux-gnu
  • GitHub Check: Quest Smoke and Load Tests for Distributed deployments
🔇 Additional comments (1)
resources/formats.json (1)

75-112: First alb_log regex refinement looks correct
The updated pattern correctly captures client and target IP/port pairs, introduces non-greedy user_agent matching, and aligns perfectly with the fields list. The named groups and order are consistent, and the anchoring remains proper.

coderabbitai[bot]
coderabbitai bot previously approved these changes Apr 26, 2025
@nitisht nitisht merged commit a90ed13 into parseablehq:main Apr 26, 2025
14 checks passed
@nikhilsinhaparseable nikhilsinhaparseable deleted the schema-unverified branch April 27, 2025 15:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants