Distinct values from a cube #24

longhotsummer · 2016-10-28T10:45:20Z

We have a postgresql table with about 28 million facts with a financial_year column. Users can use the babbage API to essentially query the distinct financial_year values, which is about 10 unique values.

Postgresql seems to be very naive when doing SELECT DISTINCT financial_year FROM table because it runs a table scan even though financial_year has an index, which takes 60+ seconds. This seems to be a known problem with postgresql.

How have others solved this problem? Do we split out the financial_year data (and all the other dimensions of a fact) into a separate table?

The text was updated successfully, but these errors were encountered:

akariv · 2016-10-29T06:59:28Z

@longhotsummer yes, I hit that brick wall myself a couple of months ago.

I did solve it partially, by changing the DISTINCT queries to GROUP BY ones - turns out they're much quicker (still inexcusably slow, though).
The changes are in this pull request: #22

I didn't merge it yet cause tests were failing for some reason and I didn't have time to understand why (but I'm still using that branch for the time being).

If you want to CR it, and maybe try and understand why those two tests are failing, it could be great. Either way you can take this branch and see if it works for you as well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distinct values from a cube #24

Distinct values from a cube #24

longhotsummer commented Oct 28, 2016 •

edited

Loading

akariv commented Oct 29, 2016

Distinct values from a cube #24

Distinct values from a cube #24

Comments

longhotsummer commented Oct 28, 2016 • edited Loading

akariv commented Oct 29, 2016

longhotsummer commented Oct 28, 2016 •

edited

Loading