Skip to content

Calcite patterns command brain pattern method #3570

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

songkant-aws
Copy link
Contributor

@songkant-aws songkant-aws commented Apr 22, 2025

Description

This aims to resolve #3569

BRAIN pattern method of Patterns command is implemented by combined UDF and UDAF.

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • New functionality has javadoc added.
  • New functionality has a user manual doc added.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Sorry, something went wrong.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
@songkant-aws songkant-aws requested review from LantaoJin and penghuo May 26, 2025 09:33
Comment on lines 45 to 48
* byClause: optional. The log groups to be labeled or aggregated. It could be fields and scalar functions.
* pattern_method: optional. Specify pattern method to be simple_pattern. By default, it's simple_pattern if the setting ``plugins.ppl.default.pattern.method`` is not specified.
* pattern_mode: optional. label mode or aggregation mode. Default is label mode.
* pattern_max_sample_count: optional. The max sample logs to be returned per pattern in aggregation mode.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it is optional, add default value in doc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added default value

* pattern_method: optional. Specify pattern method to be brain. By default, it's simple_pattern if the setting ``plugins.ppl.default.pattern.method`` is not specified.
* pattern_mode: optional. label mode or aggregation mode. Default is label mode.
* pattern_max_sample_count: optional. The max sample logs to be returned per pattern in aggregation mode.
* pattern_buffer_limit: optional. This is a special safeguard parameter for BRAIN algorithm to limit internal temporary buffer to hold processed logs.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove "special"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 60 to 61
* pattern_max_sample_count: optional. The max sample logs to be returned per pattern in aggregation mode.
* pattern_buffer_limit: optional. This is a special safeguard parameter for BRAIN algorithm to limit internal temporary buffer to hold processed logs.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it is optional, add default value in doc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added default value.


or

patterns [new_field=<new-field-name>] [pattern=<pattern>] <field> SIMPLE_PATTERN
patterns <field> [by byClause...] pattern_method=SIMPLE_PATTERN [pattern_mode=LABEL | AGGREGATION] [pattern_max_sample_count=integer] [new_field=<new-field-name>] [pattern=<pattern>]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit,
all options are under pattern command, we can simpliy it, for instance pattern_method -> method, pattern_mode -> mode

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed those options


* field: mandatory. The field must be a text field.
* byClause: optional. The log groups to be labeled or aggregated. It could be fields and scalar functions.
* pattern_method: optional. Specify pattern method to be brain. By default, it's simple_pattern if the setting ``plugins.ppl.default.pattern.method`` is not specified.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

brain or BRAIN, is it case-sensitive?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be case-insensitive. @songkant-aws please double check. I suggest to use lower case in syntax doc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's case-insensitive. Now they are all in lower cases in syntax doc.


* field: mandatory. The field must be a text field.
* byClause: optional. The log groups to be labeled or aggregated. It could be fields and scalar functions.
* pattern_method: optional. Specify pattern method to be brain. By default, it's simple_pattern if the setting ``plugins.ppl.default.pattern.method`` is not specified.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default, it's simple_pattern if the setting plugins.ppl.default.pattern.method is not specified.

The default value is configured by the setting plugins.ppl.default.pattern.method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* field: mandatory. The field must be a text field.
* byClause: optional. The log groups to be labeled or aggregated. It could be fields and scalar functions.
* pattern_method: optional. Specify pattern method to be brain. By default, it's simple_pattern if the setting ``plugins.ppl.default.pattern.method`` is not specified.
* pattern_mode: optional. label mode or aggregation mode. Default is label mode.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, The default value is configured by the setting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

+-------------------------------------------------------------------------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| patterns_field | pattern_count | sample_logs |
|-------------------------------------------------------------------------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <*IP*> - <*> [<*>/Sep/<*>:<*>:<*>:<*> <*>] <*> <*> HTTP/<*><*>" <*> <*> | 4 | [177.95.8.74 - upton5450 [28/Sep/2022:10:15:57 -0700] "HEAD /e-business/mindshare HTTP/1.0" 404 19927,127.45.152.6 - pouros8756 [28/Sep/2022:10:15:57 -0700] "GET /architectures/convergence/niches/mindshare HTTP/1.0" 100 28722,118.223.210.105 - - [28/Sep/2022:10:15:57 -0700] "PATCH /strategize/out-of-the-box HTTP/1.0" 401 27439,210.204.15.104 - - [28/Sep/2022:10:15:57 -0700] "POST /users HTTP/1.1" 301 9481] |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. In aggregation mode, does the pattern command collect sample values of IP addresses?
  2. What is the output syntax of the pattern_fields? Can it be used directly in search queries?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. in IT, the detected pattern is PacketResponder failed <token1> blk_<token2> what does IP means, is it token?

Copy link
Contributor Author

@songkant-aws songkant-aws Jun 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Yes, when Calcite is enabled, the sample logs will be converted to sample tokens in different position.
  2. pattern_field will be string in format like PacketResponder failed <token1> blk_<token2>. tokens will be a map like {token1: [...], token2: [...]}. Not sure what does it mean by using them directly in search queries?
  3. <IP> is one of variable placeholder of BrainLogParser's output. Yes, it's token in V2 output format.

I have updated the syntax docs with more examples. When Calcite is enabled, the output syntax is a pattern string with <token*> placeholder plus a map of corresponding tokens. User can leverage those two output columns for further query.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what does it mean by using them directly in search queries?

For instance,
As user, the detected pattern is <token1> - <token2> [<token3>/Sep/<token4>:<token5>:<token6>:<token7> <token8>] <token9> <token10> HTTP/<token11><token12>\" <token13> <token14>, I want to search logs match this pattern.
I think it can not been directly used, need to rewise the query, it is out of scope this PR.

Copy link
Member

@LantaoJin LantaoJin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the code base and address the latest commends.


* field: mandatory. The field must be a text field.
* byClause: optional. The log groups to be labeled or aggregated. It could be fields and scalar functions.
* pattern_method: optional. Specify pattern method to be brain. By default, it's simple_pattern if the setting ``plugins.ppl.default.pattern.method`` is not specified.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be case-insensitive. @songkant-aws please double check. I suggest to use lower case in syntax doc.

Comment on lines 45 to 48
PATTERN_MODE: 'PATTERN_MODE';
PATTERN_METHOD: 'PATTERN_METHOD';
PATTERN_MAX_SAMPLE_COUNT: 'PATTERN_MAX_SAMPLE_COUNT';
PATTERN_BUFFER_LIMIT: 'PATTERN_BUFFER_LIMIT';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How above simplify the arguments by remove pattern_ prefix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed pattern_ prefix

Comment on lines 28 to 31
DEFAULT_PATTERN_METHOD("plugins.ppl.default.pattern.method"),
DEFAULT_PATTERN_MODE("plugins.ppl.default.pattern.mode"),
DEFAULT_PATTERN_MAX_SAMPLE_COUNT("plugins.ppl.default.pattern.max.sample.count"),
DEFAULT_PATTERN_BUFFER_LIMIT("plugins.ppl.default.pattern.buffer.limit"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove the default part? It's meaningless IMO.
plugins.ppl.default.pattern.method -> plugins.ppl.pattern.method

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed default part

* max_sample_count: optional. Max sample logs returned per pattern in aggregation mode (default: 10). The max_sample_count is configured by the setting ``plugins.ppl.pattern.max.sample.count``.
* buffer_limit: optional. Safeguard parameter for ``brain`` algorithm to limit internal temporary buffer size (default: 100,000, min: 50,000). The buffer_limit is configured by the setting ``plugins.ppl.pattern.buffer.limit``.
* new_field: Alias of the output pattern field. (default: "patterns_field").
* algorithm parameters: optional. Algorithm-specific tuning:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* new_field: Alias of the output pattern field. (default: "patterns_field").
* algorithm parameters: optional. Algorithm-specific tuning:
- ``simple_pattern`` : Define regex via "pattern".
- ``brain`` : Adjust sensitivity with variable_count_threshold (int > 0) and frequency_threshold_percentage (double 0.0 - 1.0).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, explain what is variable_count_threshold and frequency_threshold_percentage

+-------------------------------------------------------------------------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| patterns_field | pattern_count | sample_logs |
|-------------------------------------------------------------------------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <*IP*> - <*> [<*>/Sep/<*>:<*>:<*>:<*> <*>] <*> <*> HTTP/<*><*>" <*> <*> | 4 | [177.95.8.74 - upton5450 [28/Sep/2022:10:15:57 -0700] "HEAD /e-business/mindshare HTTP/1.0" 404 19927,127.45.152.6 - pouros8756 [28/Sep/2022:10:15:57 -0700] "GET /architectures/convergence/niches/mindshare HTTP/1.0" 100 28722,118.223.210.105 - - [28/Sep/2022:10:15:57 -0700] "PATCH /strategize/out-of-the-box HTTP/1.0" 401 27439,210.204.15.104 - - [28/Sep/2022:10:15:57 -0700] "POST /users HTTP/1.1" 301 9481] |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what does it mean by using them directly in search queries?

For instance,
As user, the detected pattern is <token1> - <token2> [<token3>/Sep/<token4>:<token5>:<token6>:<token7> <token8>] <token9> <token10> HTTP/<token11><token12>\" <token13> <token14>, I want to search logs match this pattern.
I think it can not been directly used, need to rewise the query, it is out of scope this PR.

import org.opensearch.sql.common.patterns.PatternUtils.ParseResult;

public class LogPatternAggFunction implements UserDefinedAggFunction<LogParserAccumulator> {
private int bufferLimit = 100000;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does bufferLimit needed?

}

public static class LogParserAccumulator implements Accumulator {
private final List<String> logMessages;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Access to logMessages is threadsafe?

@LantaoJin LantaoJin merged commit e6ab4fb into opensearch-project:main Jun 11, 2025
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
calcite calcite migration releated
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] Support BRAIN method of Patterns command in Calcite
3 participants