Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(plugins): Time Comparison in Table Chart #27355

Conversation

Antonio-RiveroMartnez
Copy link
Member

@Antonio-RiveroMartnez Antonio-RiveroMartnez commented Mar 1, 2024

SUMMARY

We are introducing the same instant time comparison controls from BigNumber with Time Comparison plugin is using into the Table Chart. This will allow users to quickly compare numeric data in tables to fixed time ranges or a custom time range.

Same as the previous work with BigNumber with Time Comparison chart, these changes are behind the experimental Feature Flag and will have no effect on current charts if the FF is OFF or if the control is not activated in the panel.

Different from the BigNumber with Time Comparison , we wont have multiple queries built in the frontend sent to the API, instead, we're expanding the original queries to send a new field that would indicate whether or not the query should produce a JOINed query with between the ranges.

Any new viz type we want to use these new controls, we must use single query approach, just BigNumber with Time Comparison should be using the multiple queries approach until a follow up PR updates it.

The main benefit of using single query and not multiple is that we can create the aforementioned JOIN statements to prepare correct results of queries, with the JOINS we can address comparison queries with Server Pagination for example which is not possible with running queries in isolation.

This PR must be followed by other non-blocking adjustments that are still in the work:

  1. Make use of a control that will display the time ranges use in both queries
  2. Make use of the DATEDIFF function to build the comparison time range
  3. Enhancements to UI with expand/collapse features and others

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

New metrics:
Screenshot 2024-03-01 at 14 11 51

New Metrics with Server Pagination:
Screenshot 2024-03-01 at 14 13 01

Regular Table if FF is OFF or the control is not being used:
Screenshot 2024-03-01 at 14 14 25

TESTING INSTRUCTIONS

  1. Make sure the Fature Flag: CHART_PLUGINS_EXPERIMENTAL is ON
  2. Create a table chart and use the new controls to display comparison metrics

ADDITIONAL INFORMATION

  • Has associated issue:
  • Required feature flags:
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

@zhaoyongjie
Copy link
Member

zhaoyongjie commented Mar 1, 2024

@Antonio-RiveroMartnez please hold on this change, I'll push a datediff function in datetime_parser this weekend, we should take care the table chart refactoring.

@Antonio-RiveroMartnez
Copy link
Member Author

Antonio-RiveroMartnez commented Mar 1, 2024

Hey @zhaoyongjie yes, as mentioned in the desc of the PR there are some PRs that could be either a follow up or updated before this. Let me know when your PR is ready and I'll adjust this one if still open or in a later PR. Thanks so much for the help with that 🎉

PS: We're not refactoring the table chart, we're adding features behind the experimental feature flag

Copy link

codecov bot commented Mar 1, 2024

Codecov Report

Attention: Patch coverage is 58.00000% with 84 lines in your changes are missing coverage. Please review.

Project coverage is 69.52%. Comparing base (69d870c) to head (78d3190).
Report is 4 commits behind head on master.

Files Patch % Lines
...d/plugins/plugin-chart-table/src/transformProps.ts 47.82% 26 Missing and 10 partials ⚠️
...tend/plugins/plugin-chart-table/src/TableChart.tsx 27.02% 24 Missing and 3 partials ⚠️
...ntend/plugins/plugin-chart-table/src/buildQuery.ts 33.33% 6 Missing and 6 partials ⚠️
...nd/plugins/plugin-chart-table/src/controlPanel.tsx 16.66% 4 Missing and 1 partial ⚠️
...ins/plugin-chart-table/src/DataTable/DataTable.tsx 0.00% 0 Missing and 2 partials ⚠️
...ugin-chart-table/src/DataTable/hooks/useSticky.tsx 0.00% 1 Missing ⚠️
superset/connectors/sqla/models.py 98.30% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #27355      +/-   ##
==========================================
- Coverage   69.71%   69.52%   -0.19%     
==========================================
  Files        1909     1909              
  Lines       74604    74813     +209     
  Branches     8323     8372      +49     
==========================================
+ Hits        52008    52014       +6     
- Misses      20544    20726     +182     
- Partials     2052     2073      +21     
Flag Coverage Δ
hive 53.70% <26.15%> (-0.07%) ⬇️
javascript 57.14% <38.51%> (-0.07%) ⬇️
mysql ?
postgres ?
presto 53.65% <26.15%> (-0.07%) ⬇️
python 82.84% <98.46%> (-0.30%) ⬇️
sqlite 77.52% <26.15%> (-0.09%) ⬇️
unit 56.89% <98.46%> (+0.37%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Antonio-RiveroMartnez Antonio-RiveroMartnez force-pushed the table_instant_time branch 2 times, most recently from 53200c5 to d97220a Compare March 1, 2024 13:41
@zhaoyongjie
Copy link
Member

@Antonio-RiveroMartnez please hold on this change, I'll push a datediff function in datetime_parser, we should take care the table chart refactoring.

Hey @zhaoyongjie yes, as mentioned in the desc of the PR there are some PRs that could be either a follow up or updated before this. Let me know when your PR is ready and I'll adjust this one if still open or in a later PR. Thanks so much for the help with that 🎉

PS: We're not refactoring the table chart, we're adding features behind the experimental feature flag

Thanks for the mention and warm welcome to discuss the detail of this PR. Since we have tens of thousands of table charts and external users to call api, we have to keep the query_object clear definition and reusable.

As our previous discussion the time comparison might have more flexible approach.

@Antonio-RiveroMartnez
Copy link
Member Author

/testenv up FEATURE_CHART_PLUGINS_EXPERIMENTAL=true

Copy link
Contributor

github-actions bot commented Mar 1, 2024

@Antonio-RiveroMartnez Ephemeral environment spinning up at http://54.186.218.22:8080. Credentials are admin/admin. Please allow several minutes for bootstrapping and startup.

@michael-s-molina michael-s-molina added review-after-release review-after-release Indicates that a release is in progress and that this PR might be reviewed later and removed review-after-release labels Mar 1, 2024
@Antonio-RiveroMartnez
Copy link
Member Author

/testenv up FEATURE_CHART_PLUGINS_EXPERIMENTAL=true

Copy link
Contributor

github-actions bot commented Mar 4, 2024

@Antonio-RiveroMartnez Ephemeral environment spinning up at http://18.236.153.199:8080. Credentials are admin/admin. Please allow several minutes for bootstrapping and startup.

@zhaoyongjie
Copy link
Member

I opened a PR for build time comparison time range in frontend, please merged that pr first, and then let's review this PR.


new_filters = temporal_filters + non_temporal_filters
query_obj_clone["filter"] = new_filters
if instant_time_comparison_range != InstantTimeComparison.CUSTOM:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bringing the convo from #27386 here. Things like this @zhaoyongjie is what I mean when the backend needs to build the time ranges. Right now, if you delete that code in your PR we still need to process something similar here which would put it back after your deletion. And that's the biggest use case on why building just in the frontend as your PR is dong doesn't seems like a good fit for all cases. About why/what/when to JOIN is needed etc, you can read this PR description and we can follow up on it. Thanks

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Antonio-RiveroMartnez Thanks for the mention, I think this line might handle this case, https://github.com/apache/superset/pull/27386/files#diff-19624c67b284caff0365272493407106402fc540e32ec2851927d61fc70a8d9fR33, but I don't really understand full codes in current function.

Basically, we should build/transform QueryObject in frontend and send corresponding QueryObject to the backend, e.g.: we don't need to care about value of InstantTimeComparison in backend because you goal is that replace(InstantTimeComparison.CUSTOM) or change(other values of InstantTimeComparison) some Filters values base on previous QueryObject.

@Antonio-RiveroMartnez
Copy link
Member Author

Opening this convo here too in case we want to talk more with a broader scope. Regarding why the backend needs to process time_ranges and not only the frontend

- Handle columns name with with spaces
- Add colums separators to better identify comparison groups
- Align text of comparison metrics to the left
- Add comments to help navigate the new JOIN query operation ofr time_comparison
- When using time compariosn, columns cannot be rearranged
@zhaoyongjie
Copy link
Member

@Antonio-RiveroMartnez @eschutho @yousoph

Thank you for addressing this issue. I've tested the PR and revisited your code. Upon further examination, I can confirm that while I've incorporated your joined logic, the time range can also be generated in the frontend, but needs more refactoring in joined logic. However, it's apparent that the filter logic and the join logic are not directly correlated. I've initiated a discussion regarding the time filter in the PR. Let's focus our attention on the join topic, technical design, and any technical debt present in this PR.

Different from the BigNumber with Time Comparison , we wont have multiple queries built in the frontend sent to the API, instead, we're expanding the original queries to send a new field that would indicate whether or not the query should produce a JOINed query with between the ranges.

As you mentioned, we'll utilize the JOIN clause to generate a query for comparison. Unfortunately, not all databases support the join clause, which is why there's an allows_joins property in BaseEngineSpec. This also explains why we use Pandas JOIN in the Rolling in Advanced Analytics.

Query was generated by this PR:

WITH query_a_results AS
  (SELECT deal_size AS deal_size,
          postal_code AS postal_code,
          order_date AS order_date,
          sum(sales) AS "SUM(sales)"
   FROM main.cleaned_sales_data
   WHERE order_date >= '2024-02-28 00:00:00'
     AND order_date < '2024-03-06 00:00:00'
   GROUP BY deal_size,
            postal_code,
            order_date
   ORDER BY "SUM(sales)" DESC
   LIMIT 1000
   OFFSET 0)
SELECT query_a_results.deal_size AS deal_size,
       query_a_results.postal_code AS postal_code,
       query_a_results.order_date AS order_date,
       query_a_results."SUM(sales)" AS "SUM(sales)",
       anon_1."SUM(sales)" AS "prev_SUM(sales)"
FROM
  (SELECT deal_size AS deal_size,
          postal_code AS postal_code,
          order_date AS order_date,
          sum(sales) AS "SUM(sales)"
   FROM main.cleaned_sales_data
   WHERE order_date >= '2024-02-21 00:00:00'
     AND order_date < '2024-02-28 00:00:00'
   GROUP BY deal_size,
            postal_code,
            order_date
   ORDER BY "SUM(sales)" DESC) AS anon_1
JOIN query_a_results ON anon_1.deal_size = query_a_results.deal_size
AND anon_1.postal_code = query_a_results.postal_code
AND anon_1.order_date = query_a_results.order_date;

Upon reviewing the JOIN logic in your code, the current implementation is an inner join. An inner join entails obtaining the intersection of two sets. I believe we might need to discuss what the comparison term signifies here. Regardless, if an option is selected—such as "enable time comparison"—and the previous result set (main query) is reduced by some rows, it could lead to a mistake. The other concerns would be performance issues in some OLAP system for the join query, e.g.: Clickhouse, but we don't discus this here.

Next, I'd like to discuss the technical designs and technical debts in this PR. We have two kinds of objects: form_data, representing chart definitions, and query_object, representing queries. You can imagine the query_object as a structured query for a database. You've introduced two new fields in the query_object: instant_time_comparison_range and instant_time_comparison_info. These are specifically used in time comparison, and their functionality overlaps entirely with the time filter in the filters field if we generate the time range in the frontend. So, these fields will be technical debts in Superset codebase.

However, we should consider a more generic question about this PR: How to generate a joined query (or joined DataFrame) in the query_object rather than creating a hack solution for a specific and experimental function in an important visualization.

Finally, as a member of this project, I believe this PR lacks well-defined technical design and introduces new technical debt. Therefore, I will reject merging this PR.

@Antonio-RiveroMartnez
Copy link
Member Author

Antonio-RiveroMartnez commented Mar 7, 2024

Hey @zhaoyongjie thanks for such detailed reply 🎉

First of all, let me tell you that I do value your input and totally respect it, we are both part of the project so we both want what's the best for it. As you stated, this is an experimental and very specific use case we are trying to solve here and because of that we're trying to actually making use of it to properly iterate and change/adjust as much as needed, that's why we thought of leveraging the safety net of a Feature Flag to not impact such an important visualization, since again, we are experimenting and we will bring all the proper perks a viz update needs like stories in Storybooks etc.

I see the point of the extra properties in QueryObject seeming hack-ish and the fact that not all databases support JOINs, and I'm totally open to refactor/change what's needed for a more robust/flexible approach. In order to do that, and so we can get this moving forward, let me ask you, how would you solve these three key points:

  1. Identify when a QueryObject needs an inner join vs regular queries that don't need that not using any custom property in the QueryObject itself? We could try sending individual queries for "original" vs "shifted" queries in the request, but that would get us into some problems when combining/associating queries to each others

  2. How would you produce the time shift if only the frontend is in charge of doing so?

  3. How would you generate the joins when needed?

Probably if we work together on defining that we could come to terms for this PR and the one waiting for review that moves all the shift logic to frontend only.

Thanks as always.

@eschutho
Copy link
Member

eschutho commented Mar 7, 2024

@zhaoyongjie thank you for your input. For a little more context, at Preset, we are trying to add comparisons by time to four different visualizations: bar, line, table and big number. There was a notable difference in how this feature would work if compared to the time filters in the frontend, and we found an important but necessary issue about comparing values when the queries were generated on the frontend, which is why we moved them to the backend. We appreciate all your help here, and our intention like yours is to approach this thoughtfully and with proper technical design. As you can see, @Antonio-RiveroMartnez has explained the use-case for us to fetch the data in one singular database call. If you could help us come to an agreement as to how to approach this, we'd be very grateful. Thanks!

@michael-s-molina michael-s-molina added review-after-release Indicates that a release is in progress and that this PR might be reviewed later and removed review-after-release labels Mar 7, 2024
- Fix table rendering when switching from Agg mode to Raw mode and viceversa
@zhaoyongjie
Copy link
Member

zhaoyongjie commented Mar 7, 2024

@eschutho and @Antonio-RiveroMartnez, thank you for your replies. I'm happy to discuss this PR.

Identify when a QueryObject needs an inner join vs regular queries that don't need that not using any custom property in the QueryObject itself? We could try sending individual queries for "original" vs "shifted" queries in the request, but that would get us into some problems when combining/associating queries to each others

We should reconsider this question as "how to describe a query_object with multiple actual queries", rather than creating a relation between different query_objects. Therefore, adding a new array field in the query_object is a straightforward way to achieve this. A query_object with multiple queries should look like this:

{
    metrics: [some-columns],
    columns: [some-metrics],
    ....
    ....
    queries: [{
        metrics: [],
        columns: [],
        .....
    }]
}

How would you produce the time shift if only the frontend is in charge of doing so?

We shouldn't consider a filter as a shifted filter or an original filter; a time filter is simply a time filter. The PR has implemented how to obtain shifted filters value.

How would you generate the joins when needed?

As mentioned before, we can't use database joins, but we should use Pandas DataFrame side joins. Currently, there are a couple of post-processing steps in the query_object, indicating how to transform the DataFrame after obtaining the database result. You need to implement the join operator, and then the joined result set should be obtained. The final query_object should look like this:

{
    metrics: [col1, col2],
    columns: [metric1],
    filters: [
        {col: "date", op: "TEMPORAL_RANGE", val: "Last Day"},
        ...
        ...
    ],
    ....
    ....
    queries: [{
        metrics: [col1, col2],
        columns: [metric1],
        filters: [
            //  generate a shifted time filter in frontend, ComparisonTimeRangeType.Month,
            {col: "date", op: "TEMPORAL_RANGE", val: "DATEADD(DATEADD(DATETIME('today'), -1, year), -1, MONTH) : DATEADD(DATETIME('today'), -1, MONTH)"},
            ...
            ...
        ],
    }],
    post_processing: [{
        operation: "join"
        options: {
            how: "left",
            on: [],
            .....
        }
    }],
}

I believe that time comparison is just one specific use case for this pattern. After implementing multiple queries in query_object and join operator, the query_object should support numerous other use cases.

@Antonio-RiveroMartnez
Copy link
Member Author

Antonio-RiveroMartnez commented Mar 7, 2024

Awesome! Yes, I like the array approach, and not sending the custom property to tell apart when you need to join or not using the post processing.

Now, what I'm not seeing is, post processing implies that the two queries are executed independently and we would be joining the result sets, correct? I see why this is more bullet proof (not all databases support JOIN) but at the same time we would need the join to happen at "pre" processing of the queries to accomplish a real comparison, correct? Otherwise you would be joining incomplete data frames at some point like I explaining with the Server Pagination and low row limit?

@zhaoyongjie
Copy link
Member

zhaoyongjie commented Mar 7, 2024

Now, what I'm not seeing is, post processing implies that the two queries are executed independently and we would be joining the result sets, correct? I see why this is more bullet proof (not all databases support JOIN) but at the same time we would need the join to happen at "pre" processing of the queries to accomplish a real comparison, correct?

In this design, I think we should don't care about how many "windows" to compare, the outer join operator should recursively apply on the main query and queries[] fields. The data flow should be

  1. get a main query DataFrame
  2. get queries DataFrame
  3. apply main query post_processing, look like:

a. join[main dataset, queries[0]]
b. join[join[main dataset, queries[0]], queries[1]]
c.....
d....
n. join[join[main dataset, queries[0]],.......... queries[n]]

Otherwise you would be joining incomplete data frames at some point like I explaining with the Server Pagination and low row limit?

Actually, I don't fully understand the relation between limit clause and comparison. In my mindset, I don't care about the limit or offset keywords in a specific SQL query. A query result is simply a set of rows. If you want to compare the top K rows in one dataset, the other dataset should also be limited to the top K rows. If they are semantically inconsistent, they cannot be compared, this is why Superset alway generate a SQL with a default order by clause. The other words that you should keep only two fields are different between main query_object and compared query_object(queries) that is post_processing and filters.

@Antonio-RiveroMartnez
Copy link
Member Author

Antonio-RiveroMartnez commented Mar 7, 2024

Great, but if let's say the query A is limited by K rows and sorted, and the inner queries are also sorted and limited by K rows, you cannot guarantee that a row in the DataFrame for query A will be present in the Dataframe of queryB, correct? Making the comparison not consistent/valid because it "could" be present "if the limit was not K but let's say K+N".

A quick example of the use case where post processing and DataFrames don't seem to work and hence my question/doubts on how to approach?

Let's say I want to run a query to get the top 2 sellers on my company, along with how much did they sell and sorted in desc order by the amount sold. All this using the time ranges of the time_comparison already mentioned (Inherited, etc).

QueryA would return: 1. Jane Doe 1 MM , 2. Rick Martin 500k
QueryB would return 1. Jane Doe 750k, 2. Mario K 300k

Let's say that Rick Martin was my 3rd best seller in QueryB so that's why he's not part of the DataFrame for QueryB, when creating the comparison metrics, if we apply post processing joins, Rick Martin would be compared with 500k vs 0 (because it's not in the DataFrame), correct?

The way I see it , that's an invalid comparison because we should have taken the metrics for Jane Doe and Rick Martin since they were the ones returned in QueryA no?

So, that's why post processing serves well when the "comparison" means "give me the data from time frame A and B and let's compare them as they are" but not when the comparison needs to "take data from time frame A, and compare that exact data vs time frame B" ? And for that a pre processing would serve better?

@betodealmeida
Copy link
Member

betodealmeida commented Mar 7, 2024

As mentioned before, we can't use database joins, but we should use Pandas DataFrame side joins.

@zhaoyongjie I disagree here. IMHO, the most important thing for Superset is to show the correct data. Anything else is acceptable — it can be slow, it can be buggy — but showing incorrect values is something we should never allow.

When you're looking at at metric with a time comparison, it's important to look at the database at a snapshot in time. If you run two separate queries and join in Pandas, there's a chance that the data is updated between the queries, and the results will be wrong. The only way to guarantee consistency is to run the query with a self join (which should be a LEFT OUTER JOIN, instead of an INNER JOIN, like you pointed out).

This is why I suggested to Antonio to have a way of telling the backend that the query is a time comparison and the backend should do the self join. (For databases that don't support joins we can do the same approach we do for top groups: we run the 2 queries instead of 1. But this should be the exception, not the norm.)

Edit: and as @Antonio-RiveroMartnez mentioned, there's another reason why we need to do a join on the database; so we can compare values when we have limits and pagination.

@michael-s-molina
Copy link
Member

michael-s-molina commented Mar 7, 2024

Hi all. I was just going through the review-after-release PRs and this one and #27386 caught my attention giving the long discussion threads and the implications of the changes. It seems that the features we're trying to achieve are not contained in an isolated plugin but require fundamental changes on how we generate queries to support the desired comparisons. If that's the case, I believe these discussions should be happening in a SIP so we all understand the proposed changes and can discuss the implications. Right now, it's hard to understand the overall goal of the project. We started with a specific plugin and now are expanding the feature to other parts of the application. This does not look like the type of change that can be achieved with the use of a feature flag because even if the new plugin is isolated, it requires fundamental changes to our query API/concepts as demonstrated by the many comments in this thread.

@Antonio-RiveroMartnez @kgabryje @zhaoyongjie @eschutho @betodealmeida @john-bodley @villebro @rusackas

@zhaoyongjie
Copy link
Member

zhaoyongjie commented Mar 7, 2024

the most important thing for Superset is to show the correct data. Anything else is acceptable — it can be slow, it can be buggy — but showing incorrect values is something we should never allow.

@betodealmeida, This viewpoint is completely invalid. The correctness of data is not determined by whether SQL or Pandas is used, but rather by which pattern do you use for creating a query and result set.

BTW, The underlying layers of a database still translate the SQL into in-memory computing, sounds like a pandas calculation.

This design pattern (always find a way that a database doesn't support JOIN clause) is already present in the current design: allows_joins property in BaseEngineSpec and get top series, and Rolling in AA.

@betodealmeida
Copy link
Member

@betodealmeida, This viewpoint is completely invalid. The correctness of data is not determined by whether SQL or Pandas is used, but rather by which pattern do you use for creating a query and result set.

BTW, The underlying layers of a database still translate the SQL into in-memory computing, sounds like a pandas calculation.

You're missing my point.

The self join runs as a single transaction. The data being compared is exactly the same in both sides of the join, because it runs in a snapshot of the database.

If you run two queries on a table that's being updated with high frequency, the data might change between the first and the second queries. The two results of the comparison come from different snapshots of the data, and the comparison might be invalid because of that.

This design pattern is already present in the current design: allows_joins property in BaseEngineSpec and get top series, and Rolling in AA.

I'm familiar with allow_joins, and that's exactly what I'm proposing here: do the self join if the database support, and if not, fallback to Pandas join.

@zhaoyongjie
Copy link
Member

zhaoyongjie commented Mar 7, 2024

I'm familiar with allow_joins, and that's exactly what I'm proposing here: do the self join if the database support, and if not, fallback to Pandas joi

@betodealmeida Thanks for the explanation, currently, we don't have a transaction manager so if want to join in Pandas, the transaction context, at least keep the multiple queries in a same transaction should implement as well.

@betodealmeida
Copy link
Member

@betodealmeida Thanks for the explanation, currently, we don't have a transaction manager so if want to join in Pandas, the transaction context, at least keep the multiple queries in a same transaction should implement as well.

Right, if we had a transaction manager, and we could run the two queries in the same transaction, and the isolation level of the transaction was either "repeatable read" or "serializable", then we'd be able to guarantee data correctness.

@zhaoyongjie
Copy link
Member

Right, if we had a transaction manager, and we could run the two queries in the same transaction, and the isolation level of the transaction was either "repeatable read" or "serializable", then we'd be able to guarantee data correctness.

The same issue should fix in the Rolling and Top series codebases.

@rusackas
Copy link
Member

rusackas commented Mar 7, 2024

@zhaoyongjie I think in @michael-s-molina 's comment above, his suggestion to open a SIP for the approach proposed on this PR is reasonable. Do you think that you would also be willing to open a SIP advocating for your approach in moving the logic to the frontend in #27386? It seems that both approaches should be given equal consideration and voting opportunity by Apache process.

I do agree with @betodealmeida that data correctness is the most important virtue in Superset, so that comes first. If both approaches can for sure always be correct/valid, then it comes down to which will have the fewest edge cases (i.e. complexities around pagination in a table chart) and general philosophy (what logic should the frontend's responsibility vs. the backend's responsibility in a visualization component).

It might be a little unusual to have two SIPs that are mutually exclusive, but it might be a good chance to reset and let each approach make a case for itself before deciding by vote.

@zhaoyongjie
Copy link
Member

zhaoyongjie commented Mar 7, 2024

QueryA would return: 1. Jane Doe 1 MM , 2. Rick Martin 500k
QueryB would return 1. Jane Doe 750k, 2. Mario K 300k

Let's say that Rick Martin was my 3rd best seller in QueryB so that's why he's not part of the DataFrame for QueryB, when creating the comparison metrics, if we apply post processing joins, Rick Martin would be compared with 500k vs 0 (because it's not in the DataFrame), correct?

The way I see it , that's an invalid comparison because we should have taken the metrics for Jane Doe and Rick Martin since they were the ones returned in QueryA no?

So, that's why post processing serves well when the "comparison" means "give me the data from time frame A and B and let's compare them as they are" but not when the comparison needs to "take data from time frame A, and compare that exact data vs time frame B" ? And for that a pre processing would serve better?

@Antonio-RiveroMartnez

Query A
Jane Doe         1 MM
Rick Martin      500k

Query B
Jane Doe         750k
Mario K            300k

If apply A INNER JOIN B on name, the result should be

Result
Jane Doe         1 MM    750K

If apply A LEFT JOIN B on name, the result should be

Jane Doe         1 MM      750k
Rick Martin      500k      NULL

Copy link
Contributor

Ephemeral environment shutdown and build artifacts deleted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
packages plugins review-after-release Indicates that a release is in progress and that this PR might be reviewed later size/XXL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants