Skip to content

Latest commit

 

History

History
64 lines (53 loc) · 2.09 KB

outliers-z.md

File metadata and controls

64 lines (53 loc) · 2.09 KB

Identify Outliers

View interactive notebook here.

Description

Outlier detection is a key step to any analysis. It allows you to quickly spot errors in data collection, or which points you may need to remove before you do any statistical modeling. There are a number of ways to find outliers, this snippet focuses on the Z score method.

The Data

For this example, we'll be looking at some data that shows the total number of daily streams for the Top 200 Tracks on Spotify every day since Jan 1 2017:

Data:

streams day
234109876 25/10/2020
... ....

Z-Score

Similar to Method 2, the z-score method tells us the probability of a point occurring in a normal distribution.

image

Image from simplypsychology.org To use z-scores to identify outliers, we can set an alpha (significance level) that will determine how 'likely' each point will be in order for it to be classified as an outlier. This method also assumes the data is normally distributed.

SELECT 
   find_z.*,
   case 
      when abs(z_score) >=abs(<Z_THRESHOLD>) then 'outlier' else 'not outlier' 
   end label
FROM (
   SELECT
      <OTHER_COLUMNS>,
      <OUTLIER_COLUMN>,
      (<OUTLIER_COLUMN> - avg(<OUTLIER_COLUMN>) over ())
/ (stddev(<OUTLIER_COLUMN>) over ()) z_score
   FROM <TABLE> AS data) find_z
ORDER BY abs(z_score) DESC

where:

  • <OTHER_COLUMNS>: Columns selected other than the outlier column
  • <OUTLIER_COLUMN>: Column name for outlier
  • <Z_THRESHOLD>: z-score threshold to identify outliers (suggested: 1.96)

Example

select find_z.*,
case when abs(z_score) >=abs(1.96) then 'outlier' else 'not outlier' end label
from (select 
  day, 
  streams, 
  (streams - avg(streams) over ())
     / (stddev(streams) over ()) z_score
from data) find_z
order by abs(z_score) desc
day streams z_score label
07/01/2017 177567024 -1.971 outlier
02/12/2020 294978502 1.957 not outlier