Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Outlier removal - why using 85 percentile #109

Open
imagejan opened this issue Jun 21, 2018 · 3 comments
Open

Outlier removal - why using 85 percentile #109

imagejan opened this issue Jun 21, 2018 · 3 comments

Comments

@imagejan
Copy link

The node documentation for Outlier Removal says:

Boxplot: outliers are defined to be greater than "Q85+Factor*IQR" or smaller than "Q25 - Factor*IQR",
where Q are the quantiles and IQR is the inter quantile range. Be careful the default 3 goes with
the default method (Mean +- SD). A standard value for this method would be 1.5.

which is in line with the source code:

double lowerQuantile = stats.getPercentile(25);
double upperQuantile = stats.getPercentile(85);

But what's the reason for using 85 for the upper (instead of 75, the upper quartile), when using at the same time 25 (the lower quartile) for the lower limit?

Apparently, someone else had this question as well 😄 :

uBound = stats.getPercentile(75); //TODO: ask Felix why he used 85 instead of 75!

@Meyenhofer any comments on this?

@niederle
Copy link
Collaborator

True, it looks like a typo and it seems I already started to implement a new version of the NodeModel of this node (some years ago...) and was wondering about it too.

@imagejan
Copy link
Author

For others stumbling upon this: you can easily get a standard boxplot outlier removal (i.e. 1.5-times inter-quartile-range) using an R Snippet node with the following code (without grouping though...):

x = knime.in$"myColumn"
result <- x[!x %in% boxplot.stats(x)$out]
knime.out <- data.frame(result)

See also:
https://stackoverflow.com/a/4937343/1919049

@fmeyenhofer
Copy link
Collaborator

given that the box should include 50% of all the samples 85 instead of 75 must have been a typo. d.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants