Explore this snippet with some demo data here.
Cumulative distribution functions (CDF) are a method for analysing the distribution of a quantity, similar to histograms. They show, for each value of a quantity, what fraction of rows are smaller or greater. One method for calculating a CDF is as follows:
select
-- Use a row_number window function to get the position of this row
(row_number() over (order by <quantity> asc)) / (select count(*) from <table>) cdf,
<quantity>,
from <table>
where
quantity
- the column containing the metric of interesttable
- the table name
Using total Spotify streams as an example data source, let's identify:
table
- this is calledraw
quantity
- this is the columnstreams
then the query becomes:
select
(row_number() over (order by streams asc)) / (select count(*) from raw) frac,
streams,
from raw
frac | streams |
---|---|
0 | 374359 |
0 | 375308 |
... | ... |
1 | 7778950 |