Skip to content

Conversation

youjie23
Copy link

  • ​​Add​​ alarm recovery detection with a recovery-observation-period(default 0).
  • Store​​ the alarm recovery record with the same UUID as the related alarm record.
  • Notify​​ hooks using a recovery-text-template. The default template is '[Recovered]' + text-template, and the notification includes the recoveryTime.

Submodule PR:

@wu-sheng wu-sheng added backend OAP backend related. feature New feature labels Oct 11, 2025
@youjie23
Copy link
Author

youjie23 commented Oct 11, 2025

Apologies for the oversight. While merging the latest master code, the @BanyanDB.Group annotation in the AlarmRecoveryRecordclass was accidentally missed, which caused the e2e test failure @wankai123 @wu-sheng
​​I will fix it immediately and re-run the tests.​

@wu-sheng
Copy link
Member

Take your time.

Comment on lines 96 to 97
Long recoveryTime = getAlarmRecoveryTime(alarmRecord.getUuid(), duration);
AlarmMessage alarmMessage = buildAlarmMessage(alarmRecord, recoveryTime);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have concerns about the way you are doing this. Querying status from a list usually results a bad performance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should at least get the alarm list first. Then use the UUID list to retrieve the recovery list.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the helpful feedback. I've pushed new commits to address the points you raised. Please take another look when you have a moment, and let me know if anything else needs adjustment.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears that the e2e test job on the GitHub Actions workflow was blocked and then got canceled. I'm not entirely sure if this is an issue on my end. I've sampled all the alarm e2e tests and some other tests that were not marked as completed; they all seemed to have passed verification. Is there anything I need to do on my side to allow them to run to completion?

@wu-sheng
Copy link
Member

Please fix CI.

@youjie23
Copy link
Author

Please fix CI.

It appears that the e2e test job on the GitHub Actions workflow was blocked and then got canceled. I've sampled all the alarm e2e tests and some other tests that were not marked as completed; they all seemed to have passed verification. Could you please spare a moment to guide me on what I need to do to get them to run to completion?

@wu-sheng
Copy link
Member

Are you setting the recovery quickly enough? They are running for over one hour, and be cancelled due to preset timeout

@youjie23
Copy link
Author

youjie23 commented Oct 15, 2025

Are you setting the recovery quickly enough? They are running for over one hour, and be cancelled due to preset timeout

It seems unrelated to the test cases. I observed that some test cases had been verified successfully before the 18-minute mark, but the test did not continue execute. like [E2E test (Alarm ES, test/e2e-v2/cases/alarm/es/e2e.yaml)] (https://github.com/apache/skywalking/actions/runs/18516094658/job/52781047577#logs) which just cost 10minute to detect recovery.
And it’s not just the alarm case that gets stuck. Other verified cases also did not continue to execute. like E2E test (Log FluentBit ES 8.8.1, test/e2e-v2/cases/log/fluent-bit/e2e.yaml, ES_VERSION=8.8.1)

@wu-sheng
Copy link
Member

Another PR just passed all the tests and merged. I assume if there is anything wrong, it is in this change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend OAP backend related. feature New feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Enhance the alarm kernel with recovered status notification capability for alarm rules.

2 participants