Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for model based outlier detection #160

Open
mitchelloharawild opened this issue May 28, 2019 · 7 comments
Open

Add support for model based outlier detection #160

mitchelloharawild opened this issue May 28, 2019 · 7 comments
Milestone

Comments

@mitchelloharawild
Copy link
Member

The signature that I'm imagining for this function is:
outliers(model, data, level, ...)

Which returns a tibble containing the rows from data which are classified as outliers from model at a given level of confidence.

A default method is also defined which uses quantiles of residuals.

@mitchelloharawild mitchelloharawild added this to the v0.2.0 milestone May 29, 2019
@davidtedfordholt
Copy link
Contributor

One other possibility for the output of something that would identify_outliers() would be a tsibble with a new column called .outlier (or a vector elsewhere in the object), which could be as simple as a logical, but could also hold a statistic about the degree to which it is an outlier. This would allow for easy plotting of outliers with autoplot() and the like.

This would also allow for a function that would smooth_outliers(), which could be anything from an implementation of forecast::tsclean() up through implementing all the methods from imputeTS::na_interpolation(). If this created, for example, a new column called .imputed, then you could autoplot() the series, showing the original and altered series, for easy comparison. You could also easily compare methods of identifying outliers, etc.

tsbl %>%
  identify_outliers(ARIMA(value), .98) %>%
  smooth_outliers("StructTS") %>%
  model(THETA(value))

@Fuco1
Copy link

Fuco1 commented Oct 6, 2020

I second @davidtedfordholt, having all the data "in-line" is very nice to work with, even if less memory efficient (maybe?). The format OP proposes can be then easily derived using a filter.

We could of course also just extend the tsibble using a left join or something but it feels a bit more "annoying". I like to pipe data through steps and "enrich" it on the way, such that returning a sub-set would make this impossible/difficult.

@davidtedfordholt
Copy link
Contributor

It's also simple enough to have a smooth_outliers function called on a tsibble that hasn't got an .outlier variable call a function that simultaneously identifies and imputes a replacement value, in order to address use cases where efficiency is paramount.

@mitchelloharawild
Copy link
Member Author

I still like the idea of having outliers() being a function which returns a tibble of outlying observations row numbers (or perhaps better, a tsibble of the outliers themselves).

Another higher level function like smooth_outliers() (or possibly replace_outliers()?) can then build upon outliers(), model(), and interpolate().

Much like how outliers will be determined with a model-based approach, the way in which they're replaced should also be done via a model specification. I would prefer StructTS(y~...) rather than "StructTS", where StructTS(y~...) is a model specification much like ARIMA(y~...).

@davidtedfordholt
Copy link
Contributor

It seems beyond the scope of this to consider outlier time series within a larger population of time series. Are we interested in handling both point and subsequence outliers?

Trying to get the idea solidly in my head. outliers() would follow model(), take in a threshold specification, then returning a subset of the input tsbl containing rows with a abs(.resid) > threshold?

I think a part of my struggle with the output being either the row numbers or the outliers by themselves, rather than an augmented tsbl, is that we would then need to feed both the output of outliers() and the original tsbl into something to replace them. I realize that I'm thinking more in terms of EDA than production.

Here's what I'm thinking. Once we've looked at the data and determined that we need to examine outliers, we plot them. If outliers() outputs a subset tsbl, we have to feed the original object back in.

tsbl %>%
    model(ARIMA(value ~ trend())) %>%
    outliers(~ std_dev(5.4)) %>%    # we know the response from the model
    autoplot(tsbl)

If we want to see the band represented by the threshold, we end up needing to feed autoplot() the mbl, the output of outliers() and the specification details of the call to outliers().

mbl <- model(tsbl, ARIMA(value ~ trend()))
tsbl_outlyr <-  outliers(mbl, ~ std_dev(5.4))        # I saved 1 character and made it look xtrēm
autoplot(tsbl_outlyr, mbl, ~ std_dev(5.4))

If we want to look at a couple of different methods or different thresholds, we're saving objects left and right, and autoplot() is lost to us.

If, on the other hand, we output an augmented version of the original tsbl, we can autoplot(). You can create an additional key for the .detect_method if there are multiple methods, which allows you to facet them to compare. You can also look at what the series look(s) like after you run interpolate(), (which can be made to treat as NA any value marked as an outlier, or just write a wrapper called replace_outliers() or whatever, which does so). You could interpolate using multiple models, based on multiple detection methods, and they would all be available for comparison.

I can't come up with a place where it seems more useful to have a subset of the original tsbl than it would be to have an augmented one. I feel like I'm missing something. That said, I think I might also be trying to protect the excellent name outliers() from being used for a function that I believe has limited use (subsetting), rather than it being the centerpiece. That could mean it identifies outliers for piped use (outputting an augmented tsbl), or as a function that can take both a threshold and a replacement formula (calling interpolate()), returning a tsbl of the same dimensions as the original, but with new values for outliers.

@mitchelloharawild
Copy link
Member Author

FYI the outliers() generic was added to {fabletools} in tidyverts/fabletools@e6631da

There is an outliers method for the feasts::X_13ARIMA_SEATS() model, and I hope other models will be supported. There will likely be a residual based outlier detection fallback method, similar to what has been described in this thread.

@brunocarlin
Copy link
Contributor

I think maybe following recipes structure may be beneficial, I have created a package to implement outlier detection as a step, tidy.outliers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants