Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(anomaly detection): Maintain historical timeseries #1670

Merged
merged 7 commits into from
Jan 3, 2025

Conversation

aayush-se
Copy link
Member

@aayush-se aayush-se commented Dec 21, 2024

  • Maintains historical data in a new table DbDynamicAlertTimeSeriesHistory which stores the required information for us to recreate and test any issues
    • alert_id : int
    • timestamp : int
    • value : float
    • anomaly_type : str
    • saved_at : datetime
  • Cron runs once a week to clean up data in this table over 90 days old

@aayush-se aayush-se requested a review from a team as a code owner December 21, 2024 00:22
@aayush-se aayush-se requested a review from ram-senth December 30, 2024 23:55
"dynamic_alert_time_series_history",
sa.Column("id", sa.Integer(), autoincrement=True, nullable=False),
sa.Column("alert_id", sa.BigInteger(), nullable=False),
sa.Column("timestamp", sa.DateTime(), nullable=False),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are cleaning up history based on timestamp, do we need an index on this field?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The id column is to avoid duplicate values in the rows since the other columns can be the same including the timestamped ones if they are added at the same time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was asking about the timestamp column. Since the where clause of the SQL will have timestamp, we need an index (not primary index as id will still be the primary index).

# ### commands auto generated by Alembic - please adjust! ###
op.create_table(
"dynamic_alert_time_series_history",
sa.Column("id", sa.Integer(), autoincrement=True, nullable=False),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious how come alert_id is biginteger but this id is integer? We will be having more rows in this table than the alert table, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch -- fixed

with Session() as session:
deleted_count = (
session.query(DbDynamicAlertTimeSeriesHistory)
.filter(DbDynamicAlertTimeSeriesHistory.saved_at < date_threshold)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition is applied across all alerts, so can be a costly operation. Also, should we be filtering on timestamp instead of save_at?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I am filtering on saved_at instead of timestamp is because I thought that we want the timeseries points to be available 90 days after they are deleted, not created. But if that is not the case, I can change it to timestamp instead

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the costly operation, is there another way to do this since this table doesn't have any specific relationships to other tables aside from it being a storage table?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of having 90 days of time series data regardless of when it was saved to seer. As to the deletion operation, can you check if sqlalchemy has any batch deletion feature that we can leverage?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filtering by timeseries and using the Delete object to remove values more efficiently:

@aayush-se aayush-se merged commit c730bd7 into main Jan 3, 2025
11 checks passed
@aayush-se aayush-se deleted the anomaly-detection/maintain-historical branch January 3, 2025 23:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants