Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE REQUEST] Unsupervised Evaluation Metrics #131

Open
anton164 opened this issue Nov 1, 2022 · 4 comments
Open

[FEATURE REQUEST] Unsupervised Evaluation Metrics #131

anton164 opened this issue Nov 1, 2022 · 4 comments

Comments

@anton164
Copy link

anton164 commented Nov 1, 2022

Is your feature request related to a problem? Please describe.
The current evaluation metrics in evaluate/anomaly.py assume that a ground truth available. However, in many time series anomaly detection problems there is no ground truth.

It would be great if the Merlion evaluation base classes were more general and supportive of this use-case. As of now we effectively have to implement our own evaluation methods.

Describe the solution you'd like
I think ideally methods/classes such as TSADEvaluator.evaluate, TSADScoreAccumulator and accumulate_tsad_score should not assume that there is a ground truth - other interfaces in the Merlion package typically take test labels as an optional argument. Similarly, the evaluation classes should be able to compute unsupervised descriptive statistics if a ground truth is not passed.

@aadyotb
Copy link
Contributor

aadyotb commented Nov 1, 2022

@anton164 Thanks for the comment. You highlight a fundamental challenge with anomaly detection -- often, ground truth labels are unavailable. But in my experience, the most common metrics people use to evaluate anomaly detection algorithms are the ones supported in Merlion, all of which require ground truth labels. If you have (1) specific unsupervised metrics in mind, and (2) a compelling use case for them, you are welcome to open a pull request adding them to the repo, and I can review it. But for the time being, I'm not sure how useful these unsupervised metrics would really be.

@anton164
Copy link
Author

anton164 commented Nov 1, 2022

Thanks for your prompt reply @aadyotb. I might do that to demonstrate what I mean. Which classes would you recommend me to extend for that demonstration? From a design perspective the TSADEvaluator which does historical analysis is "coupled" to ground truth label evaluation, so maybe I'll implement another version of that which isn't.

From an unsupervised perspective it would be useful to have a simple way to evaluate the following metrics:

  • number of anomalous points detected
  • number of continuous anomalies/alarms detected
  • detections per week/day/hour/minute
  • distribution statistics for anomaly scores

As you point out - GT labels are often unavailable, so its surprising to me that Merlion which promises to be a complete framework for TS anomaly detection does not have any guidance here. Happy to try to incorporate some ideas :)

One flow I would like to support is self-supervision using Merlion:

  1. Run unsupervised detection using a suite of simple models
  2. Compare metrics & inspect detected anomalies to identify the unsupervised detector that is the best starting point
  3. Treat the "best" predictions in step 2 as a fuzzy ground truth and tune an advanced model

@aadyotb
Copy link
Contributor

aadyotb commented Nov 1, 2022

Thanks for clarifying. From an implementation perspective, I'd suggest leaving TSADEvaluator unchanged. It would probably be much simpler to just extend TSADScoreAccumulator. For the specific metrics you mention, you might even be able to get away something like accumulate_tsad_score(ground_truth=scores, predict=scores) and just examine the true positive/negative statistics accumulated.

For distribution statistics, one potentially interesting direction would be to characterize the amount the test scores deviate from a standard normal distribution, since calibration reshapes the distribution of training scores to look like a standard normal (note that this is more sophisticated than mean/variance normalization). So if the test scores don't seem like they've been drawn from a standard normal, this could be an indicator of distribution shift over time.

I'm much more hesitant to support the self-supervised labeling approach. In practice, time series anomalies vary widely (raw spikes/dips, changes in trend, deviations from standard seasonal patterns, ...). When dealing with multivariate time series, things get even more complex. Simple models often either fail to detect these more complex anomalies, or have low precision when doing so. And in many cases, users care about detecting one type of anomaly but not another. Beyond getting actual labels (and even that can be controversial), I unfortunately don't have a great answer for this problem, and I haven't seen one in the literature either.

@anton164
Copy link
Author

anton164 commented Nov 2, 2022

Thanks for sharing your thoughts @aadyotb ! I will give it a try and report back once I have a demo in Merlion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants