-
Notifications
You must be signed in to change notification settings - Fork 157
Calcite patterns command brain pattern method #3570
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Calcite patterns command brain pattern method #3570
Conversation
Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
docs/user/ppl/cmd/patterns.rst
Outdated
* byClause: optional. The log groups to be labeled or aggregated. It could be fields and scalar functions. | ||
* pattern_method: optional. Specify pattern method to be simple_pattern. By default, it's simple_pattern if the setting ``plugins.ppl.default.pattern.method`` is not specified. | ||
* pattern_mode: optional. label mode or aggregation mode. Default is label mode. | ||
* pattern_max_sample_count: optional. The max sample logs to be returned per pattern in aggregation mode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it is optional, add default value in doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Added default value
docs/user/ppl/cmd/patterns.rst
Outdated
* pattern_method: optional. Specify pattern method to be brain. By default, it's simple_pattern if the setting ``plugins.ppl.default.pattern.method`` is not specified. | ||
* pattern_mode: optional. label mode or aggregation mode. Default is label mode. | ||
* pattern_max_sample_count: optional. The max sample logs to be returned per pattern in aggregation mode. | ||
* pattern_buffer_limit: optional. This is a special safeguard parameter for BRAIN algorithm to limit internal temporary buffer to hold processed logs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove "special"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
docs/user/ppl/cmd/patterns.rst
Outdated
* pattern_max_sample_count: optional. The max sample logs to be returned per pattern in aggregation mode. | ||
* pattern_buffer_limit: optional. This is a special safeguard parameter for BRAIN algorithm to limit internal temporary buffer to hold processed logs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it is optional, add default value in doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Added default value.
docs/user/ppl/cmd/patterns.rst
Outdated
|
||
or | ||
|
||
patterns [new_field=<new-field-name>] [pattern=<pattern>] <field> SIMPLE_PATTERN | ||
patterns <field> [by byClause...] pattern_method=SIMPLE_PATTERN [pattern_mode=LABEL | AGGREGATION] [pattern_max_sample_count=integer] [new_field=<new-field-name>] [pattern=<pattern>] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit,
all options are under pattern command, we can simpliy it, for instance pattern_method -> method, pattern_mode -> mode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed those options
docs/user/ppl/cmd/patterns.rst
Outdated
|
||
* field: mandatory. The field must be a text field. | ||
* byClause: optional. The log groups to be labeled or aggregated. It could be fields and scalar functions. | ||
* pattern_method: optional. Specify pattern method to be brain. By default, it's simple_pattern if the setting ``plugins.ppl.default.pattern.method`` is not specified. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
brain or BRAIN, is it case-sensitive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be case-insensitive. @songkant-aws please double check. I suggest to use lower case in syntax doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's case-insensitive. Now they are all in lower cases in syntax doc.
docs/user/ppl/cmd/patterns.rst
Outdated
|
||
* field: mandatory. The field must be a text field. | ||
* byClause: optional. The log groups to be labeled or aggregated. It could be fields and scalar functions. | ||
* pattern_method: optional. Specify pattern method to be brain. By default, it's simple_pattern if the setting ``plugins.ppl.default.pattern.method`` is not specified. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default, it's simple_pattern if the setting
plugins.ppl.default.pattern.method
is not specified.
The default value is configured by the setting plugins.ppl.default.pattern.method
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
docs/user/ppl/cmd/patterns.rst
Outdated
* field: mandatory. The field must be a text field. | ||
* byClause: optional. The log groups to be labeled or aggregated. It could be fields and scalar functions. | ||
* pattern_method: optional. Specify pattern method to be brain. By default, it's simple_pattern if the setting ``plugins.ppl.default.pattern.method`` is not specified. | ||
* pattern_mode: optional. label mode or aggregation mode. Default is label mode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto, The default value is configured by the setting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
+-------------------------------------------------------------------------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | ||
| patterns_field | pattern_count | sample_logs | | ||
|-------------------------------------------------------------------------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | ||
| <*IP*> - <*> [<*>/Sep/<*>:<*>:<*>:<*> <*>] <*> <*> HTTP/<*><*>" <*> <*> | 4 | [177.95.8.74 - upton5450 [28/Sep/2022:10:15:57 -0700] "HEAD /e-business/mindshare HTTP/1.0" 404 19927,127.45.152.6 - pouros8756 [28/Sep/2022:10:15:57 -0700] "GET /architectures/convergence/niches/mindshare HTTP/1.0" 100 28722,118.223.210.105 - - [28/Sep/2022:10:15:57 -0700] "PATCH /strategize/out-of-the-box HTTP/1.0" 401 27439,210.204.15.104 - - [28/Sep/2022:10:15:57 -0700] "POST /users HTTP/1.1" 301 9481] | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- In aggregation mode, does the pattern command collect sample values of IP addresses?
- What is the output syntax of the pattern_fields? Can it be used directly in search queries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- in IT, the detected pattern is
PacketResponder failed <token1> blk_<token2>
what does IP means, is it token?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Yes, when Calcite is enabled, the sample logs will be converted to sample tokens in different position.
- pattern_field will be string in format like
PacketResponder failed <token1> blk_<token2>
. tokens will be a map like {token1: [...], token2: [...]}. Not sure what does it mean by using them directly in search queries? - <IP> is one of variable placeholder of BrainLogParser's output. Yes, it's token in V2 output format.
I have updated the syntax docs with more examples. When Calcite is enabled, the output syntax is a pattern string with <token*> placeholder plus a map of corresponding tokens. User can leverage those two output columns for further query.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what does it mean by using them directly in search queries?
For instance,
As user, the detected pattern is <token1> - <token2> [<token3>/Sep/<token4>:<token5>:<token6>:<token7> <token8>] <token9> <token10> HTTP/<token11><token12>\" <token13> <token14>
, I want to search logs match this pattern.
I think it can not been directly used, need to rewise the query, it is out of scope this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the code base and address the latest commends.
docs/user/ppl/cmd/patterns.rst
Outdated
|
||
* field: mandatory. The field must be a text field. | ||
* byClause: optional. The log groups to be labeled or aggregated. It could be fields and scalar functions. | ||
* pattern_method: optional. Specify pattern method to be brain. By default, it's simple_pattern if the setting ``plugins.ppl.default.pattern.method`` is not specified. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be case-insensitive. @songkant-aws please double check. I suggest to use lower case in syntax doc.
PATTERN_MODE: 'PATTERN_MODE'; | ||
PATTERN_METHOD: 'PATTERN_METHOD'; | ||
PATTERN_MAX_SAMPLE_COUNT: 'PATTERN_MAX_SAMPLE_COUNT'; | ||
PATTERN_BUFFER_LIMIT: 'PATTERN_BUFFER_LIMIT'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How above simplify the arguments by remove pattern_
prefix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed pattern_
prefix
DEFAULT_PATTERN_METHOD("plugins.ppl.default.pattern.method"), | ||
DEFAULT_PATTERN_MODE("plugins.ppl.default.pattern.mode"), | ||
DEFAULT_PATTERN_MAX_SAMPLE_COUNT("plugins.ppl.default.pattern.max.sample.count"), | ||
DEFAULT_PATTERN_BUFFER_LIMIT("plugins.ppl.default.pattern.buffer.limit"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove the default
part? It's meaningless IMO.
plugins.ppl.default.pattern.method -> plugins.ppl.pattern.method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed default
part
Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
* max_sample_count: optional. Max sample logs returned per pattern in aggregation mode (default: 10). The max_sample_count is configured by the setting ``plugins.ppl.pattern.max.sample.count``. | ||
* buffer_limit: optional. Safeguard parameter for ``brain`` algorithm to limit internal temporary buffer size (default: 100,000, min: 50,000). The buffer_limit is configured by the setting ``plugins.ppl.pattern.buffer.limit``. | ||
* new_field: Alias of the output pattern field. (default: "patterns_field"). | ||
* algorithm parameters: optional. Algorithm-specific tuning: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, this line is in italic format, is it expected? https://github.com/opensearch-project/sql/blob/8f468a5b92b4b3a2ca5fb2798f78f329c88b4581/docs/user/ppl/cmd/patterns.rst#syntax
* new_field: Alias of the output pattern field. (default: "patterns_field"). | ||
* algorithm parameters: optional. Algorithm-specific tuning: | ||
- ``simple_pattern`` : Define regex via "pattern". | ||
- ``brain`` : Adjust sensitivity with variable_count_threshold (int > 0) and frequency_threshold_percentage (double 0.0 - 1.0). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, explain what is variable_count_threshold and frequency_threshold_percentage
+-------------------------------------------------------------------------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | ||
| patterns_field | pattern_count | sample_logs | | ||
|-------------------------------------------------------------------------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | ||
| <*IP*> - <*> [<*>/Sep/<*>:<*>:<*>:<*> <*>] <*> <*> HTTP/<*><*>" <*> <*> | 4 | [177.95.8.74 - upton5450 [28/Sep/2022:10:15:57 -0700] "HEAD /e-business/mindshare HTTP/1.0" 404 19927,127.45.152.6 - pouros8756 [28/Sep/2022:10:15:57 -0700] "GET /architectures/convergence/niches/mindshare HTTP/1.0" 100 28722,118.223.210.105 - - [28/Sep/2022:10:15:57 -0700] "PATCH /strategize/out-of-the-box HTTP/1.0" 401 27439,210.204.15.104 - - [28/Sep/2022:10:15:57 -0700] "POST /users HTTP/1.1" 301 9481] | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what does it mean by using them directly in search queries?
For instance,
As user, the detected pattern is <token1> - <token2> [<token3>/Sep/<token4>:<token5>:<token6>:<token7> <token8>] <token9> <token10> HTTP/<token11><token12>\" <token13> <token14>
, I want to search logs match this pattern.
I think it can not been directly used, need to rewise the query, it is out of scope this PR.
import org.opensearch.sql.common.patterns.PatternUtils.ParseResult; | ||
|
||
public class LogPatternAggFunction implements UserDefinedAggFunction<LogParserAccumulator> { | ||
private int bufferLimit = 100000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does bufferLimit needed?
} | ||
|
||
public static class LogParserAccumulator implements Accumulator { | ||
private final List<String> logMessages; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Access to logMessages is threadsafe?
Description
This aims to resolve #3569
BRAIN
pattern method ofPatterns
command is implemented by combined UDF and UDAF.Related Issues
Resolves #[Issue number to be closed when this PR is merged]
Check List
--signoff
.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.