More general mechanics for row grouping (and filtering)? #853

dmos62 · 2021-11-30T18:26:50Z

dmos62
Nov 30, 2021
Collaborator

We noticed that there are situations where we'd like to filter not on rows, but on information derived from groupings of rows (group attributes). Example of such group-attribute-based filters are: "is this row a duplicate", "is this row unique", "sum of column A in this group is larger than X". Also, we'd like to compose such filters with the simple filters: or(empty(col1), unique(col2)).

Basically, group-based filters deconstruct to the following algorithm:

Partition rows into a set of groups (group-set) according to some grouping-function (think of this as a new temporary column containing group-id);
- Have a way to declare that grouping-function (e.g. group per unique combination of some column set, group every 5 rows, group per email domain);
Derive some group attribute using a group-attribute-function (e.g. a new temporary column containing group size);
- After this point, a group-attribute (like group size) is accessible like any other column value;
- Have a way to declare a group-attribute-function;
Interop with above mechanism to do interesting things with those group attributes, such as:
- Filter and/or sort on those group sizes (e.g. filter on row being duplicate or unique);
- Visually group rows on the frontend
  - A table can have multiple group-sets, and you can switch to using one or the other
- Use group-attributes from multiple different group-sets

What this gets us is a unified way to both do row grouping that @mathemancer is working on, and do group-attribute-based filtering that I'm working on, without duplicated logic. Plus, it's pretty powerful in terms of expressiveness.

I can see two things that require more definition:

How to implement those temporary columns that can hold group-attributes and serve as input to filtering and such? I see these columns being hidden and/or visibly distinguished by default;
How to define those table functions (like grouping-functions and group-attribute-functions)?

Relevant discussions:

mathemancer · 2021-12-01T03:39:03Z

mathemancer
Dec 1, 2021
Maintainer

One more detail to consider: This is essentially a simple window function query. However, there are more complicated window function queries where the group associated with a row for the purposes of applying a function to that row (e.g., a filter) changes based on the row itself. For example, you could have a filter: "return only rows where the sale_amt column current row is greater than both the previous row and the following row, ordered by sale_datetime. This would return only rows which are local maxima for the sale_amt column, ordered by date time. In this case, you can't assign group IDs, since the groups overlap (i.e., each consists of three rows: The current, the previous and the next. But then for the next assessment, the former next is now the current and the former current is now the previous.).

My point isn't that we shouldn't pursue this as a first generalization of the concept of "filtering", but that our portrayal in the API shouldn't rely on group ID concepts (the implementation probably should, though, in the cases you noted).

1 reply

dmos62 Dec 1, 2021
Collaborator Author

What you're describing (a true sliding window) is a single grouping (what I called group-set before) giving multiple group-ids to a single row. So where before I was talking about a group-id column that holds a single group-id, a sliding window would need the group-id column to hold sets of ids (three group-ids in your example).

kgodey · 2021-12-03T23:19:38Z

kgodey
Dec 3, 2021
Maintainer

@dmos62: I see what you're describing and it makes sense but I'm not entirely sure what you're proposing to implement so I'm having a hard time providing any useful feedback. It would be helpful if you could write up something concrete that you'd like to take action on and what outstanding questions you'd like feedback on.

A general thought: We don't need very general filtering on groupings of rows yet – we are just trying to solve how to filter for duplicate rows. I'd prefer a solution that prioritizes the immediate problem but is extendable to future problems (even if it will involve some refactoring later).

0 replies

dmos62 · 2021-12-06T09:49:02Z

dmos62
Dec 6, 2021
Collaborator Author

@kgodey you're right, this proposal is abstract at this stage. From filtering side of things, which I'm most familiar with, things seem to be trivial: it's about having access to those temporary columns when exposing filtering options to frontend, when accepting them, and when actually filtering. It's on grouping side of things that group-attributes need to be calculated per-row and made accessible to the rest of Mathesar (or just filtering in this case), and I'm fuzzy about how that could be done in concert with the current grouping system. @mathemancer, could you provide some input?

0 replies

kgodey · 2021-12-06T19:09:15Z

kgodey
Dec 6, 2021
Maintainer

I think we should hold off on generalizing this until we have more use cases designed/planned for implementation. Right now, we only have the duplicates filter and it is difficult to anticipate future needs and architecture based on one use case.

0 replies

mathemancer · 2021-12-07T05:35:07Z

mathemancer
Dec 7, 2021
Maintainer

@dmos62 If you check my range_grouping branch, I did some refactoring of the duplicate-only filter logic (see the db.records.operations.select._get_duplicate_only_cte function). While it's not all the way to the goal (since I just wanted the grouping to work with the filter), it's moving in the direction I'd suggest. The point is that this function produces SQL that operates on a relation (the input is called "table", but any relation would do) and results in a relation. This is what makes CTEs composable. You can just chain them together however you like. So, with that in mind, you could make a quick wrapper to create a CTE from any query (e.g., a query produced by the get_query function in the same module) just by appending .cte() to the select object (query; we should probably be more careful about that, it's kind of all over the place in the current codebase). With that wrapper in hand, you could, for example, pour the table being queried into the get_duplicate_only_cte, and pour the result of that into get_query for definition of further filtering, grouping, and ordering by. Alternatively, if the user wants, you could do it in the other direction. Use get_query to set up a subset of records, and then filter them for duplicates only (and then group the result of that, if you want). We'd need to unwrap some nested calls to make that tidy, but it's definitely doable.

Does that help? Or make things more complicated? Or did I miss the point of the question?

1 reply

dmos62 Dec 7, 2021
Collaborator Author

Yeah, that answers the question. Thanks for the input. I'm seeing a directed graph of CTEs that brings all database queries into a single structure.

dmos62 · 2021-12-08T12:04:21Z

dmos62
Dec 8, 2021
Collaborator Author

I'll list some semi-conceptual problems I'm seeing and try and propose a concrete solution after:

if we filter a table and then derive a view from it, should the view "inherit" the filter? Filtering a table is the conceptual equivalent of a view already (a query performed on other relations), so once we start thinking about views as well, there's some sense of redundancy and conflict between the presentation/API query (the result of the sort_by, filters, group_by parameters passed to the records endpoint) and the Postgres view query;
if we want to query some table based on its row groups (which is how this train of discussions started), we essentially need to define new columns on that table holding the relevant group attributes; those new columns holding group-attributes would be derived by querying preexisting data: which is the conceptual equivalent of a view;
if we currently want to group data, filter it and do some complex query on it like duplicates_only, we can't (not in a coherent, expressive way), because these operations are distinct query pipelines and the best we can do is choose the order in which they are performed; ideally we want a homogenous way to make powerful and expressive queries.

In summary, I came to the realisation that if we used views as the primary way to do sorting, filtering and grouping, that would solve above problems in an elegant and user-empowering way. I say primary, because we could still use current records endpoint query param method (what I called a presentation/API query above) for basic sorting and filtering (not grouping though).

What would the workflow look like?

Suppose you want to filter some data to only see duplicate rows (based on column set A). You create a view whose definition contains following directives:

the grouping operation that groups on distinct values in column set A:
- it gives each row a column containg the group-id unique to the group it belongs to (see these comments More general mechanics for row grouping (and filtering)? #853 (comment) for how we could support sliding window grouping);
- notice that the grouping operation doesn't do any sorting, filtering or anything other than adding new columns derived from existing data;
the counting of how many rows have the same group-id (defined in previous step) as a given row:
- it gives each row another column containing its group's size;
the basic filter to get only the rows whose group size column value (defined in previous step) is more than 1:
- it gets us only rows that are duplicates across column set A.

We could package a string of operations like this in some way that a novice user could apply it or look it up easily when creating or altering a view, but it would then unpackage into his view definition without hiding the underlying mechanics. I'm imagining a searchable collection of queries, a bit like a curated stackoverflow.com search.

Advantages:

it's powerful: above workflow example translates to more complicated use cases;
- how powerful it is relies on how expressive our view definition mechanism is: that means our query capabilities limit is equal to that of SQL;
it's expressive: doesn't have problems with conflict from arbitrarily composing filtering, grouping, etc.;
it's architecturally sound: instead of reinventing data processing pipelines, we're leveraging Postgres/SQL features;
it's conceptually simple, good for new comers: if you know SQL you know this and vice-versa.

Disadvantages or neutral notes:

requires refactor on backend, frontend and UX: have grouping, filtering, sorting leverage views definitions;
- reserve current filtering/sorting mechanism as a primitive, quick, highly transitory way to browse/preview data;
- note, these code areas have been experiencing a lot of activity recently, so this refactor might be faster right now than later;
relies on having a good way to define a view;
- if UX for defining views is burdensome, doing interesting queries will necessarily be burdensome too;
won't be able to have wishy-washy discussions (started and fueled by me) about duplicates_only conflicting with everything.

Challenges

Technically, this doesn't seem challenging. Relevant UX and the concepts surrounding view definitions are the areas where we're still fuzzy, I think:

How do we make view definitions simple and powerful (or at least simple for starters)?
How to give views a workflow so smooth that a user is not hesitant to use them everywhere?

@kgodey @mathemancer @silentninja @ghislaineguerin @seancolsen @pavish could you provide feedback?

Context:

Some discussion arounds views and their properties, like filters, can be found here: Support for creating a view based on an existing table or view #782

Edit: creation of views seems to be relatively undefined in the design specs https://wiki.mathesar.org/design/specs/create-edit-delete-views

6 replies

dmos62 Dec 8, 2021
Collaborator Author

@mathemancer I agree about this not being a short-term priority.

leveraging Postgres/SQL features

By saying that this proposal would leverage Postgres features, I meant that we wouldn't have to play with separate pipelines for doing grouping, filtering, etc., since it would expose the grouping stages as primitives (columns) of the view/table.

Also, CTEs are PostgreSQL features

On the topic of views vs CTEs, I think of them like functions and anonymous functions respectively. With views you can reference them when making other views. I guess you could do that with CTEs too, but that would be awkward, since a CTE is only referencable within the same query. Also a view can be materialized, if you want the added performance. For most purposes they're pretty much the same in my mind, so I didn't mention CTEs.

We're working very hard to avoid the need for users to know SQL.

That's an interesting point. My thinking is that it would be good to give users a good experience of SQL. Minimal learning curve and all that. The user will have to learn and get used to Mathesar abstractions. What I'm thinking is that it would be good if Mathesar abstractions mapped well to SQL abstractions. I didn't mean that a user should be familiar with SQL to use Mathesar, rather that if he were to become familiar with Mathesar, that would translate to being familiar with main concepts in SQL as well. "The painless way of learning and using SQL."

I do like the idea of exposing the group_metadata column to users under certain conditions

This proposal came about from me thinking how to do that. My first idea was similar to what you're suggesting I think. Adding temporary columns to a table and then filtering or sorting on that. But that's awkward. These derived columns have to be "temporary", since you don't want to add auxiliary columns to your ground-truth, normalized tables. But, then that concept of "temporary" columns is problematic of its own. It's conceptually replacing the table you're working on with a view, so why not make it explicit and just create a view and do what you need in its definition. That's my train of thought.

This is also how I use spreadsheets by the way. I have a normalized sheet that is maximally information dense and write-locked, and I use procedural references and derivations to populate auxiliary sheets with analysis, explorations, aggregations, etc. in a declarative way. It's not a single pipe either: some sheets will be derived from multiple others, hence the "checkpoints"/sheets. Generally there's a set of source sheets, then a middle-layer of auxiliary sheets, and then a set of final sheets that contain the accessible presentation of calculations performed in auxiliary sheets on source sheets.

mathemancer Dec 9, 2021
Maintainer

I think we're maybe envisioning the same thing (or something similar). I think that if you want to persist a given arrangement (including some generated columns, whether group_metadata or not), a view is the appropriate tool. I just don't think we should always create the view.

Instead, I think we should make it possible to view the results of a query (which will be a relation in the "relational database" sense) in tabular form, including the metadata column if desired. You should then be able to extend and modify that query using the UI. Finally, if you want to come back to that query result later, you should persist the relation you're currently seeing in the UI as a view.

What I'm thinking is that it would be good if Mathesar abstractions mapped well to SQL abstractions.

Completely agree with this.

On the topic of views vs CTEs, I think of them like functions and anonymous functions respectively. With views you can reference them when making other views. I guess you could do that with CTEs too, but that would be awkward, since a CTE is only referencable within the same query.

I'm thinking of them similarly. There would be some underlying CTEs that the user wouldn't know/care about. The result of a query would be displayed to the user. I think they should be able to build up a query from there without persisting a view, though. The idea would be that, due to the nature of CTEs, they'd be able to build up a more complex query using a given query result as a CTE. I agree that if they discover they're using a given query often, they should persist that query as a view. (E.g., if all their work starts with joining the artist and album tables every day, they should persist the results of that join with a view).

These derived columns have to be "temporary", since you don't want to add auxiliary columns to your ground-truth, normalized tables.

I'm thinking this more along the lines of building a query by building off of a given query result. There are (at least) two ways to do that: use the query (result) as a CTE, or persist it as a view, and then query the view. I think the persisting should be the point of the views, not the query-building. From that perspective, you'd want the user to be able to choose one way or the other based on whether they wanted to come back to a given result in a future session, or use it in multiple other queries, etc.

dmos62 Dec 9, 2021
Collaborator Author

I think the persisting should be the point of the views, not the query-building.

You've made a good point; I agree.

I think we're on the same page. My main concern at the moment is using more of SQL primitives. For example, making group-attributes, like group-id or group-size into columns (column being the most primitive row attribute) so that we can can use plain filtering and sorting on them.

mathemancer Dec 10, 2021
Maintainer

For me as a user, I'd definitely want those derived metadata columns exposed, since I'd be able to construct more complex queries by piecing together queries using those columns, giving me ways to accomplish end results not currently offered by Mathesar. Somewhere we have a discussion about an "advanced mode". I wonder if exposing these metadata or derived columns is the sort of thing that we should show to advanced users, but not beginners? I.e., the advanced user will get more use out of them, and also be less confused by their appearance.

dmos62 Dec 10, 2021
Collaborator Author

Maybe have those derived columns be visually distinct, hidable and hidden by default. An advanced user can change the default behavior if he likes.

Then again, it might be awkward to define a filter on a column that is not visible, or to create a grouping and not see the columns it results in. Visual distinction and comfortable hidability mechanics might be enough.

kgodey · 2021-12-10T22:11:43Z

kgodey
Dec 10, 2021
Maintainer

I'm responding to #853 (comment), doing it in a new comment since the previous thread is already long enough that all replies aren't loaded without a further click.

Some thoughts in no particular order:

I like the idea of users being able to create and use derived metadata columns in Views. Reading through this discussion has been helpful to me as I think about how to model data sources for Views (see: Column types for Views #838).
I don't like these ideas:
- having Views be the only way to do grouping.
- exposing derived columns to the user in Tables, this is going to be confusing. I'm open to having an "advanced mode" in the future, but I don't think we should prioritize working on it for the first release.
Our process for figuring out what features to support has usually been to work backwards from the visual designs. The visual design that led to the duplicates_only filter requirement was to make life easier for users when setting NOT NULL constraints. The workflow here is that the user tries to set NOT NULL constraint but it doesn't work because they have some columns with duplicates. We want to make it easier for them to fix these rows so we show them what those rows are. In this use case, the user is not served by generalizing this mechanic or showing derived columns as far as I can see. Since this is our use case, we also don't need this "filter" to work with our other filters, so if that's what is causing the issue, I'd rather change the UX so that we can't apply other filters on top of the duplicates filter.

5 replies

dmos62 Dec 13, 2021
Collaborator Author

derived metadata columns in Views

exposing derived columns to the user in Tables

We might be talking about the same thing, but just to clarify and sum up: the metadata/derived columns only make sense as a query on the data (as in data in the word metadata), so they'll probably always be part of a CTE/view. We were saying that a table page in the UI, might not necessarily show the verbatim table, but a query on it (a CTE). When you filter it, for example, that's what happens. That query might introduce new derived columns. User could then save that query as a Postgres view. I could see a UX where that's not confusing.

mathemancer Dec 15, 2021
Maintainer

Based on @kgodey 's comment, it occurred to me that we're kind of creating a distinction in our UI between "filtering" SELECT statements and "generative" SELECT statements, but that distinction doesn't exist in SQL. I think we should make sure we're aware of that fact, and try to proceed in a way that makes it as simple as possible to piece together all the possible SELECT statements in the future.

From this discussion, it also seems we now have 3 ways to look at tables needed in Mathesar:

see the table in its "raw" form, maybe filtered or sorted
see results of a more general SELECT from a table
see a persisted view saved from the results of (1) or (2).

kgodey Dec 16, 2021
Maintainer

I was seeing 1 and 2 as the same thing, hence my thinking that "duplicates only" was a filter. From the user's perspective, I don't think it matters that some "filters" involve information derived from groupings of rows and others don't.

mathemancer Dec 16, 2021
Maintainer

For 2, I was thinking more of the derived columns you mentioned. I.e., you don't want to expose those to all users out of the gate (I agree on that), but if we want to be able to expose derived columns (or the like) at any point, we need ways to combine the concepts of creating / transforming data with the concepts of removing / filtering data when requesting "records" (or more generally data derived from records).

I also agree that the distinction for the user between duplicates_only and even numbers only is likely unimportant, except for confusion around why these two filters don't combine well.

dmos62 Dec 16, 2021
Collaborator Author

I also agree that the distinction for the user between duplicates_only and even numbers only is likely unimportant, except for confusion around why these two filters don't combine well.

I think the fact that they don't combine well is important. To me that says that we need to decompose them to their constinuent parts, because those do compose.

I think we're at least 90% on the same page with @mathemancer. Generative, transformative and filtering concepts should be combined, I totally agree.

Concerning 1 and 2, I'd like to go a step further and not distinguish more complex SELECT queries from basic filtering SELECTs. I think that UI shortcuts for doing basic things is good, but even at a high-level there shouldn't be a conceptual difference between the different SELECTs.

kgodey · 2021-12-16T02:19:05Z

kgodey
Dec 16, 2021
Maintainer

I'm not sure where to go from here in this conversation. @dmos62, could you clarify what action you'd like to see coming out of this discussion?

11 replies

dmos62 Dec 17, 2021
Collaborator Author

By the way, primitive metacolumns would simplify things on the frontend since you wouldn't have group-attribute information in a separate data structure.

mathemancer Dec 20, 2021
Maintainer

The reason for wrapping it in a single metadata column is that you can just avoid showing that entire column on the front end (and indeed avoid returning it as a data column at all) more easily. You don't need to find all the metadata columns, just the one. Another benefit is that it avoids cluttering the column namespace. As for the ad-hoc extension, this comes from the fact that an internal process could add to that metadata column without needing to worry about cleaning up after itself.

When we're ready to start actually showing and using this metadata in the front end (e.g., letting the user define filters on metadata attributes), I'd definitely want to move some of this up to primitive columns to create a more structured definition. But, I like having unstructured for internal processes.

It occurs to me that It might make sense to just have a __mathesar_metadata column, and put all metadata in that (multiple groups, other stuff, etc.) instead of the current __mathesar_group_metadata column.

dmos62 Dec 20, 2021
Collaborator Author

Using primitive columns for all data that might be filtered/sorted/grouped/aggregated on does require rethinking clean up and hiding. I'm leaning towards minimal automation and more information for the user. If she knows which column comes from where and can easily hide whatever she wants, both those things are a non-problem. It's true that internal/external data structures have to be squashed into only external, but I think that has a strong upside.

mathemancer Jan 3, 2022
Maintainer

I think the question is whether an auto-generated column that the user didn't specifically ask for is "owned" by the user. For me, it's kind of system info, not user info. OTOH, if I'm the user, I want access to the system columns (though maybe not by default) so I can use them in ways you've described and others.

Another point in favor of using proper columns for each of these metadata in the case where we do want to allow the user to interact with them and, e.g., filter by them: It would enable the proper filters based on the type of the column. In contrast, if everything is wrapped in JSON, we'll have to do extra work to make that metadata available to the normal filtering system.

dmos62 Jan 3, 2022
Collaborator Author

I think of these "meta" columns as generated by the user. I.e. user asks for them. We can provide recipes for common tasks, so as not to overwhelm beginners, but ultimately the user generates those columns because they are useful when processing the rest of the columns.

More general mechanics for row grouping (and filtering)? #853

dmos62 Nov 30, 2021 Collaborator

Replies: 8 comments · 24 replies

mathemancer Dec 1, 2021 Maintainer

dmos62 Dec 1, 2021 Collaborator Author

kgodey Dec 3, 2021 Maintainer

dmos62 Dec 6, 2021 Collaborator Author

kgodey Dec 6, 2021 Maintainer

mathemancer Dec 7, 2021 Maintainer

dmos62 Dec 7, 2021 Collaborator Author

dmos62 Dec 8, 2021 Collaborator Author

What would the workflow look like?

Advantages:

Disadvantages or neutral notes:

Challenges

dmos62 Dec 8, 2021 Collaborator Author

mathemancer Dec 9, 2021 Maintainer

dmos62 Dec 9, 2021 Collaborator Author

mathemancer Dec 10, 2021 Maintainer

dmos62 Dec 10, 2021 Collaborator Author

kgodey Dec 10, 2021 Maintainer

dmos62 Dec 13, 2021 Collaborator Author

mathemancer Dec 15, 2021 Maintainer

kgodey Dec 16, 2021 Maintainer

mathemancer Dec 16, 2021 Maintainer

dmos62 Dec 16, 2021 Collaborator Author

kgodey Dec 16, 2021 Maintainer

dmos62 Dec 17, 2021 Collaborator Author

mathemancer Dec 20, 2021 Maintainer

dmos62 Dec 20, 2021 Collaborator Author

mathemancer Jan 3, 2022 Maintainer

dmos62 Jan 3, 2022 Collaborator Author

dmos62
Nov 30, 2021
Collaborator

Replies: 8 comments 24 replies

mathemancer
Dec 1, 2021
Maintainer

dmos62 Dec 1, 2021
Collaborator Author

kgodey
Dec 3, 2021
Maintainer

dmos62
Dec 6, 2021
Collaborator Author

kgodey
Dec 6, 2021
Maintainer

mathemancer
Dec 7, 2021
Maintainer

dmos62 Dec 7, 2021
Collaborator Author

dmos62
Dec 8, 2021
Collaborator Author

dmos62 Dec 8, 2021
Collaborator Author

mathemancer Dec 9, 2021
Maintainer

dmos62 Dec 9, 2021
Collaborator Author

mathemancer Dec 10, 2021
Maintainer

dmos62 Dec 10, 2021
Collaborator Author

kgodey
Dec 10, 2021
Maintainer

dmos62 Dec 13, 2021
Collaborator Author

mathemancer Dec 15, 2021
Maintainer

kgodey Dec 16, 2021
Maintainer

mathemancer Dec 16, 2021
Maintainer

dmos62 Dec 16, 2021
Collaborator Author

kgodey
Dec 16, 2021
Maintainer

dmos62 Dec 17, 2021
Collaborator Author

mathemancer Dec 20, 2021
Maintainer

dmos62 Dec 20, 2021
Collaborator Author

mathemancer Jan 3, 2022
Maintainer

dmos62 Jan 3, 2022
Collaborator Author