Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved patterns command with new algorithm #3263

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

songkant-aws
Copy link

@songkant-aws songkant-aws commented Jan 24, 2025

Description

This PR introduces enhancement on original patterns command. It keeps the patterns command naming and rebuild it with specific window functions to parse log messages with different pattern method(log parser algorithms). See this RFC: #3251

The sample query input and output will be like:

  • New brain algorithm log parser:
image
  • Original regex based patterns command will have minor change:
image

Caveat:
For now, this is a draft PR for initial review so that we can collect major feedbacks to be addressed later.

Related Issues

Resolves #3251

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • New functionality has javadoc added.
  • New functionality has a user manual doc added.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@YANG-DB
Copy link
Member

YANG-DB commented Jan 24, 2025

Thanks for the initiative!!

Copy link
Collaborator

@penghuo penghuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general,

new Function(
ctx.pattern_method != null
? ctx.pattern_method.getText().toLowerCase(Locale.ROOT)
: BuiltinFunctionName.BRAIN.name(), // By default, use new algorithm
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we introduce a setting to control the default pattern algorithms?

Copy link
Author

@songkant-aws songkant-aws Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see. Like a java property to specify default algorithm name? I can add that.

((NamedArgumentExpression) expression).getValue().valueOf().stringValue())
.findFirst()
.orElse("");
return new StreamPatternRowWindowFrame(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need StreamPatternRowWindowFrame? could we just use CurrentRowWindowFrame instead? patternExpression could be member of StreamPatternWindowFunction?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will remove the need of StreamPatternRowWindowFrame in the next revision.

List<Expression> sortFields =
windowDefinition.getSortList().stream().map(Pair::getRight).collect(Collectors.toList());

ExprValue last = peers.get(peers.size() - 1);
return resolve(sortFields, last).equals(resolve(sortFields, next));
}

private boolean isSamePartition(ExprValue next) {
protected boolean isSamePartition(ExprValue next) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert unnecessary change.

return value.stringValue();
})
.toList();
this.preprocessedMessages.addAll(logParser.preprocessAllLogs(logMessages));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BufferPatternRowsWindowFrame should have it own spec, doet it means over all rows? @dai-chen WindowFrame definition and How to use WIndowFrame should be seperate, right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For simplicity, I set the spec to empty partition by and empty sort by during AST tree parsing unresolved WindowFunction, which treats the window frame range is all rows on coordinator node. Because I haven't seen requirements on sorting and partitioning on other columns.

I think we can add partition by and sort by syntax if we see values in case users want to specify them. Thoughts?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think users are free to use pattern functions added with any window frame definition.

@penghuo penghuo added the enhancement New feature or request label Jan 25, 2025
return value.stringValue();
})
.toList();
this.preprocessedMessages.addAll(logParser.preprocessAllLogs(logMessages));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think users are free to use pattern functions added with any window frame definition.

Comment on lines +41 to +42
repository.register(brain());
repository.register(simplePattern());
Copy link
Collaborator

@dai-chen dai-chen Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking are these new algorithm really window function? If users specify order by or partition by, does it still generate meaningful result?

UnresolvedExpression sourceField,
String alias,
java.util.Map<String, Literal> arguments) {
return new Pattern(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wonder if pattern is mostly syntax sugar for pattern window function, is new logical operator still required?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the integration with LogicalWindow, I see they are quite similar. Yeah, I think Pattern operator is probably not needed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't find the Window unresolvedPlan. So replace Pattern with Window instead.

import org.opensearch.sql.expression.window.frame.WindowFrame;

@EqualsAndHashCode(callSuper = true)
public class BufferPatternWindowFunction extends FunctionExpression
Copy link
Collaborator

@dai-chen dai-chen Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

order by expression in window definition decides who're peers right? What's the order by expression for buffer (offline) pattern function? I didn't find it and may miss it in design doc:

Project[message#1, patterns_field#2]
+- Window[brain(message#1), windowsSpec(partitionBy=null, frame=PeerRowsWindowFrame)]
   +- OpenSearchIndexScan

Copy link
Collaborator

@dai-chen dai-chen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A high-level question about the long-term approach: Currently, the pattern command always generates a new field, which offers flexibility as users can perform aggregation (stats) or other operations on it afterward. However, for large datasets, I assume most users would prefer functionality similar to CloudWatch Logs Query Syntax - Pattern.

Do you see any potential issues with the current implementation, which is based on SQL window functions combined with aggregate functions, as we scale this feature in the future?

@songkant-aws
Copy link
Author

songkant-aws commented Jan 28, 2025

Do you see any potential issues with the current implementation, which is based on SQL window functions combined with aggregate functions, as we scale this feature in the future?

@dai-chen It's a good question. Actually, it's one of my previous thought to use a new specific operator to directly return grouped logs with sample count or grouped samples.

It's more like a compound aggregation operator with a default group by. But current aggregation abstraction only supports row by row iteration. The log pattern algorithms cover both streaming and buffering computing paradigm. Then I figure out this is more close to window operator's abstraction. To achieve what Cloudwatch has done, I think we could extend pattern functions from AggregateWindowFunction that has a special aggregator state structure.

As to partition by and sort by specification, my initial plan is to deliver a simple version without them because that's probably enough.
partition by is probably more useful than sort by. Imagine users may want to see different log patterns for INFO, WARN, ERROR partitions.
I haven't seen a use case for sort by yet. Maybe they need to see latest logs sort by date?

@songkant-aws
Copy link
Author

songkant-aws commented Jan 28, 2025

To scale the pattern function, I think other issues are pending investigation.

  1. How to push down operator to OpenSearch data node? Is Painless script one of option?
  2. OpenSearch doesn't have shuffle mechanism for now, how to perform large dataset aggregation like operation with simple map reduce? If data volume is quite large, spilling additional data to disk is probably required during execution.
  3. To log pattern, streaming algorithms are probably more suitable to large scale dataset. We may need to introduce more by then.

@songkant-aws
Copy link
Author

AvgTime benchmark:

Benchmark                                          (testDataType)  Mode  Cnt  Score    Error  Units
ComparisonOperatorBenchmark.testEqualOperator                 int  avgt    3  0.002 ±  0.001  ms/op
ComparisonOperatorBenchmark.testEqualOperator              string  avgt    3  0.005 ±  0.001  ms/op
ComparisonOperatorBenchmark.testEqualOperator                date  avgt    3  0.001 ±  0.001  ms/op
ComparisonOperatorBenchmark.testGreaterOperator               int  avgt    3  0.002 ±  0.001  ms/op
ComparisonOperatorBenchmark.testGreaterOperator            string  avgt    3  0.005 ±  0.002  ms/op
ComparisonOperatorBenchmark.testGreaterOperator              date  avgt    3  0.001 ±  0.001  ms/op
ComparisonOperatorBenchmark.testLessOperator                  int  avgt    3  0.002 ±  0.001  ms/op
ComparisonOperatorBenchmark.testLessOperator               string  avgt    3  0.006 ±  0.001  ms/op
ComparisonOperatorBenchmark.testLessOperator                 date  avgt    3  0.002 ±  0.001  ms/op
PatternsWindowFunctionBenchmark.testBrain                     N/A  avgt    3  0.131 ±  0.002  ms/op
PatternsWindowFunctionBenchmark.testSimplePattern             N/A  avgt    3  0.001 ±  0.001  ms/op

@songkant-aws
Copy link
Author

This PR still needs some action items.

  1. Pending adding setting to switch default log pattern algorithm
  2. Pending discussion whether we want to deliver partition by and order by given its effect. (Log patterns are probably special window function that is not sensitive to partition by and order by)

@songkant-aws songkant-aws changed the title [DRAFT] Improved patterns command with new algorithm Improved patterns command with new algorithm Jan 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RFC] Improve patterns command with more advanced log pattern algorithms
4 participants