Skip to content

Calcite patterns command brain pattern method #3570

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

songkant-aws
Copy link
Contributor

@songkant-aws songkant-aws commented Apr 22, 2025

Description

This aims to resolve #3569

BRAIN pattern method of Patterns command is implemented by combined UDF and UDAF.

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • New functionality has javadoc added.
  • New functionality has a user manual doc added.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Sorry, something went wrong.

Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
@songkant-aws songkant-aws requested review from LantaoJin and penghuo May 26, 2025 09:33
Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
Comment on lines 45 to 48
* byClause: optional. The log groups to be labeled or aggregated. It could be fields and scalar functions.
* pattern_method: optional. Specify pattern method to be simple_pattern. By default, it's simple_pattern if the setting ``plugins.ppl.default.pattern.method`` is not specified.
* pattern_mode: optional. label mode or aggregation mode. Default is label mode.
* pattern_max_sample_count: optional. The max sample logs to be returned per pattern in aggregation mode.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it is optional, add default value in doc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added default value

* pattern_method: optional. Specify pattern method to be brain. By default, it's simple_pattern if the setting ``plugins.ppl.default.pattern.method`` is not specified.
* pattern_mode: optional. label mode or aggregation mode. Default is label mode.
* pattern_max_sample_count: optional. The max sample logs to be returned per pattern in aggregation mode.
* pattern_buffer_limit: optional. This is a special safeguard parameter for BRAIN algorithm to limit internal temporary buffer to hold processed logs.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove "special"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 60 to 61
* pattern_max_sample_count: optional. The max sample logs to be returned per pattern in aggregation mode.
* pattern_buffer_limit: optional. This is a special safeguard parameter for BRAIN algorithm to limit internal temporary buffer to hold processed logs.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it is optional, add default value in doc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added default value.


or

patterns [new_field=<new-field-name>] [pattern=<pattern>] <field> SIMPLE_PATTERN
patterns <field> [by byClause...] pattern_method=SIMPLE_PATTERN [pattern_mode=LABEL | AGGREGATION] [pattern_max_sample_count=integer] [new_field=<new-field-name>] [pattern=<pattern>]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit,
all options are under pattern command, we can simpliy it, for instance pattern_method -> method, pattern_mode -> mode

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed those options


* field: mandatory. The field must be a text field.
* byClause: optional. The log groups to be labeled or aggregated. It could be fields and scalar functions.
* pattern_method: optional. Specify pattern method to be brain. By default, it's simple_pattern if the setting ``plugins.ppl.default.pattern.method`` is not specified.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

brain or BRAIN, is it case-sensitive?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be case-insensitive. @songkant-aws please double check. I suggest to use lower case in syntax doc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's case-insensitive. Now they are all in lower cases in syntax doc.


* field: mandatory. The field must be a text field.
* byClause: optional. The log groups to be labeled or aggregated. It could be fields and scalar functions.
* pattern_method: optional. Specify pattern method to be brain. By default, it's simple_pattern if the setting ``plugins.ppl.default.pattern.method`` is not specified.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default, it's simple_pattern if the setting plugins.ppl.default.pattern.method is not specified.

The default value is configured by the setting plugins.ppl.default.pattern.method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* field: mandatory. The field must be a text field.
* byClause: optional. The log groups to be labeled or aggregated. It could be fields and scalar functions.
* pattern_method: optional. Specify pattern method to be brain. By default, it's simple_pattern if the setting ``plugins.ppl.default.pattern.method`` is not specified.
* pattern_mode: optional. label mode or aggregation mode. Default is label mode.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, The default value is configured by the setting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

+-------------------------------------------------------------------------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| patterns_field | pattern_count | sample_logs |
|-------------------------------------------------------------------------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <*IP*> - <*> [<*>/Sep/<*>:<*>:<*>:<*> <*>] <*> <*> HTTP/<*><*>" <*> <*> | 4 | [177.95.8.74 - upton5450 [28/Sep/2022:10:15:57 -0700] "HEAD /e-business/mindshare HTTP/1.0" 404 19927,127.45.152.6 - pouros8756 [28/Sep/2022:10:15:57 -0700] "GET /architectures/convergence/niches/mindshare HTTP/1.0" 100 28722,118.223.210.105 - - [28/Sep/2022:10:15:57 -0700] "PATCH /strategize/out-of-the-box HTTP/1.0" 401 27439,210.204.15.104 - - [28/Sep/2022:10:15:57 -0700] "POST /users HTTP/1.1" 301 9481] |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. In aggregation mode, does the pattern command collect sample values of IP addresses?
  2. What is the output syntax of the pattern_fields? Can it be used directly in search queries?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. in IT, the detected pattern is PacketResponder failed <token1> blk_<token2> what does IP means, is it token?

Copy link
Contributor Author

@songkant-aws songkant-aws Jun 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Yes, when Calcite is enabled, the sample logs will be converted to sample tokens in different position.
  2. pattern_field will be string in format like PacketResponder failed <token1> blk_<token2>. tokens will be a map like {token1: [...], token2: [...]}. Not sure what does it mean by using them directly in search queries?
  3. <IP> is one of variable placeholder of BrainLogParser's output. Yes, it's token in V2 output format.

I have updated the syntax docs with more examples. When Calcite is enabled, the output syntax is a pattern string with <token*> placeholder plus a map of corresponding tokens. User can leverage those two output columns for further query.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what does it mean by using them directly in search queries?

For instance,
As user, the detected pattern is <token1> - <token2> [<token3>/Sep/<token4>:<token5>:<token6>:<token7> <token8>] <token9> <token10> HTTP/<token11><token12>\" <token13> <token14>, I want to search logs match this pattern.
I think it can not been directly used, need to rewise the query, it is out of scope this PR.

Copy link
Member

@LantaoJin LantaoJin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the code base and address the latest commends.


* field: mandatory. The field must be a text field.
* byClause: optional. The log groups to be labeled or aggregated. It could be fields and scalar functions.
* pattern_method: optional. Specify pattern method to be brain. By default, it's simple_pattern if the setting ``plugins.ppl.default.pattern.method`` is not specified.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be case-insensitive. @songkant-aws please double check. I suggest to use lower case in syntax doc.

Comment on lines 45 to 48
PATTERN_MODE: 'PATTERN_MODE';
PATTERN_METHOD: 'PATTERN_METHOD';
PATTERN_MAX_SAMPLE_COUNT: 'PATTERN_MAX_SAMPLE_COUNT';
PATTERN_BUFFER_LIMIT: 'PATTERN_BUFFER_LIMIT';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How above simplify the arguments by remove pattern_ prefix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed pattern_ prefix

Comment on lines 28 to 31
DEFAULT_PATTERN_METHOD("plugins.ppl.default.pattern.method"),
DEFAULT_PATTERN_MODE("plugins.ppl.default.pattern.mode"),
DEFAULT_PATTERN_MAX_SAMPLE_COUNT("plugins.ppl.default.pattern.max.sample.count"),
DEFAULT_PATTERN_BUFFER_LIMIT("plugins.ppl.default.pattern.buffer.limit"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove the default part? It's meaningless IMO.
plugins.ppl.default.pattern.method -> plugins.ppl.pattern.method

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed default part

Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
* max_sample_count: optional. Max sample logs returned per pattern in aggregation mode (default: 10). The max_sample_count is configured by the setting ``plugins.ppl.pattern.max.sample.count``.
* buffer_limit: optional. Safeguard parameter for ``brain`` algorithm to limit internal temporary buffer size (default: 100,000, min: 50,000). The buffer_limit is configured by the setting ``plugins.ppl.pattern.buffer.limit``.
* new_field: Alias of the output pattern field. (default: "patterns_field").
* algorithm parameters: optional. Algorithm-specific tuning:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* new_field: Alias of the output pattern field. (default: "patterns_field").
* algorithm parameters: optional. Algorithm-specific tuning:
- ``simple_pattern`` : Define regex via "pattern".
- ``brain`` : Adjust sensitivity with variable_count_threshold (int > 0) and frequency_threshold_percentage (double 0.0 - 1.0).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, explain what is variable_count_threshold and frequency_threshold_percentage

+-------------------------------------------------------------------------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| patterns_field | pattern_count | sample_logs |
|-------------------------------------------------------------------------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <*IP*> - <*> [<*>/Sep/<*>:<*>:<*>:<*> <*>] <*> <*> HTTP/<*><*>" <*> <*> | 4 | [177.95.8.74 - upton5450 [28/Sep/2022:10:15:57 -0700] "HEAD /e-business/mindshare HTTP/1.0" 404 19927,127.45.152.6 - pouros8756 [28/Sep/2022:10:15:57 -0700] "GET /architectures/convergence/niches/mindshare HTTP/1.0" 100 28722,118.223.210.105 - - [28/Sep/2022:10:15:57 -0700] "PATCH /strategize/out-of-the-box HTTP/1.0" 401 27439,210.204.15.104 - - [28/Sep/2022:10:15:57 -0700] "POST /users HTTP/1.1" 301 9481] |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what does it mean by using them directly in search queries?

For instance,
As user, the detected pattern is <token1> - <token2> [<token3>/Sep/<token4>:<token5>:<token6>:<token7> <token8>] <token9> <token10> HTTP/<token11><token12>\" <token13> <token14>, I want to search logs match this pattern.
I think it can not been directly used, need to rewise the query, it is out of scope this PR.

import org.opensearch.sql.common.patterns.PatternUtils.ParseResult;

public class LogPatternAggFunction implements UserDefinedAggFunction<LogParserAccumulator> {
private int bufferLimit = 100000;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does bufferLimit needed?

}

public static class LogParserAccumulator implements Accumulator {
private final List<String> logMessages;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Access to logMessages is threadsafe?

@LantaoJin LantaoJin merged commit e6ab4fb into opensearch-project:main Jun 11, 2025
22 checks passed
penghuo pushed a commit that referenced this pull request Jun 16, 2025
* Revert simple_pattern window function change to recover pushdown ability

Signed-off-by: Songkan Tang <[email protected]>

* Add SIMPLE_PATTERN patterns command support based on parse command

Signed-off-by: Songkan Tang <[email protected]>

* Address minor comments

Signed-off-by: Songkan Tang <[email protected]>

* Address comments part 2

Signed-off-by: Songkan Tang <[email protected]>

* Make allowCast for pattern VARCHAR literal

Signed-off-by: Songkan Tang <[email protected]>

* Fix spotless

Signed-off-by: Songkan Tang <[email protected]>

* Minor ut failure fix

Signed-off-by: Songkan Tang <[email protected]>

* Brain patterns command in Calcite with combined UDF and UDAF

Signed-off-by: Songkan Tang <[email protected]>

* Revert debug flag

Signed-off-by: Songkan Tang <[email protected]>

* Minor ut failure fix

Signed-off-by: Songkan Tang <[email protected]>

* Minor ut failure fix part2

Signed-off-by: Songkan Tang <[email protected]>

* Pick missing ast Window from main

Signed-off-by: Songkan Tang <[email protected]>

* Support  agg and label mode and new model for patterns command

Signed-off-by: Songkan Tang <[email protected]>

* Remove unnecessary files and comments

Signed-off-by: Songkan Tang <[email protected]>

* Use uncollect_patterns table function to flatten patterns list

Signed-off-by: Songkan Tang <[email protected]>

* Fix partial UT

Signed-off-by: Songkan Tang <[email protected]>

* Add 3570 yaml tests

Signed-off-by: Songkan Tang <[email protected]>

* Fix plans in explain ITs

Signed-off-by: Songkan Tang <[email protected]>

* Fix pushdown ITs failure

Signed-off-by: Songkan Tang <[email protected]>

* Fix doctest examples for V2 engine results

Signed-off-by: Songkan Tang <[email protected]>

* Minor fix after rebasing

Signed-off-by: Songkan Tang <[email protected]>

* Uncomment build.gradle change

Signed-off-by: Songkan Tang <[email protected]>

* Address minor comment

Signed-off-by: Songkan Tang <[email protected]>

* Address patterns doc comments and fix conflicts

Signed-off-by: Songkan Tang <[email protected]>

* Fix doctest

Signed-off-by: Songkan Tang <[email protected]>

* Reuse expand command plan to replace hacky uncollect_patterns UDTF

Signed-off-by: Songkan Tang <[email protected]>

* Minor fix after resolving merge conflicts

Signed-off-by: Songkan Tang <[email protected]>

* Refactor duplicate building expand rel node logic

Signed-off-by: Songkan Tang <[email protected]>

* Fix the issue of expand command plan executing main query twice

Signed-off-by: Songkan Tang <[email protected]>

---------

Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: xinyual <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
calcite calcite migration releated
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] Support BRAIN method of Patterns command in Calcite
3 participants