-
Notifications
You must be signed in to change notification settings - Fork 13.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(plugins): Time Comparison in Table Chart #27355
feat(plugins): Time Comparison in Table Chart #27355
Conversation
@Antonio-RiveroMartnez please hold on this change, I'll push a datediff function in datetime_parser this weekend, we should take care the table chart refactoring. |
032e2a4
to
de485a8
Compare
Hey @zhaoyongjie yes, as mentioned in the desc of the PR there are some PRs that could be either a follow up or updated before this. Let me know when your PR is ready and I'll adjust this one if still open or in a later PR. Thanks so much for the help with that 🎉 PS: We're not refactoring the table chart, we're adding features behind the experimental feature flag |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #27355 +/- ##
==========================================
- Coverage 69.71% 69.52% -0.19%
==========================================
Files 1909 1909
Lines 74604 74813 +209
Branches 8323 8372 +49
==========================================
+ Hits 52008 52014 +6
- Misses 20544 20726 +182
- Partials 2052 2073 +21
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
53200c5
to
d97220a
Compare
Thanks for the mention and warm welcome to discuss the detail of this PR. Since we have tens of thousands of table charts and external users to call api, we have to keep the query_object clear definition and reusable. As our previous discussion the time comparison might have more flexible approach. |
6f610a8
to
775d05e
Compare
/testenv up FEATURE_CHART_PLUGINS_EXPERIMENTAL=true |
@Antonio-RiveroMartnez Ephemeral environment spinning up at http://54.186.218.22:8080. Credentials are |
67e9dda
to
a717e30
Compare
/testenv up FEATURE_CHART_PLUGINS_EXPERIMENTAL=true |
@Antonio-RiveroMartnez Ephemeral environment spinning up at http://18.236.153.199:8080. Credentials are |
I opened a PR for build time comparison time range in frontend, please merged that pr first, and then let's review this PR. |
|
||
new_filters = temporal_filters + non_temporal_filters | ||
query_obj_clone["filter"] = new_filters | ||
if instant_time_comparison_range != InstantTimeComparison.CUSTOM: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bringing the convo from #27386 here. Things like this @zhaoyongjie is what I mean when the backend needs to build the time ranges. Right now, if you delete that code in your PR we still need to process something similar here which would put it back after your deletion. And that's the biggest use case on why building just in the frontend as your PR is dong doesn't seems like a good fit for all cases. About why/what/when to JOIN is needed etc, you can read this PR description and we can follow up on it. Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Antonio-RiveroMartnez Thanks for the mention, I think this line might handle this case, https://github.com/apache/superset/pull/27386/files#diff-19624c67b284caff0365272493407106402fc540e32ec2851927d61fc70a8d9fR33, but I don't really understand full codes in current function.
Basically, we should build/transform QueryObject
in frontend and send corresponding QueryObject
to the backend, e.g.: we don't need to care about value of InstantTimeComparison
in backend because you goal is that replace(InstantTimeComparison.CUSTOM
) or change(other values of InstantTimeComparison) some Filters values base on previous QueryObject.
Opening this convo here too in case we want to talk more with a broader scope. Regarding why the backend needs to process time_ranges and not only the frontend |
- Handle columns name with with spaces
- Add colums separators to better identify comparison groups
- Align text of comparison metrics to the left
- Add comments to help navigate the new JOIN query operation ofr time_comparison
- When using time compariosn, columns cannot be rearranged
60491b9
to
39bb074
Compare
@Antonio-RiveroMartnez @eschutho @yousoph Thank you for addressing this issue. I've tested the PR and revisited your code. Upon further examination, I can confirm that while I've incorporated your joined logic, the
As you mentioned, we'll utilize the JOIN clause to generate a query for comparison. Unfortunately, not all databases support the join clause, which is why there's an allows_joins property in Query was generated by this PR: WITH query_a_results AS
(SELECT deal_size AS deal_size,
postal_code AS postal_code,
order_date AS order_date,
sum(sales) AS "SUM(sales)"
FROM main.cleaned_sales_data
WHERE order_date >= '2024-02-28 00:00:00'
AND order_date < '2024-03-06 00:00:00'
GROUP BY deal_size,
postal_code,
order_date
ORDER BY "SUM(sales)" DESC
LIMIT 1000
OFFSET 0)
SELECT query_a_results.deal_size AS deal_size,
query_a_results.postal_code AS postal_code,
query_a_results.order_date AS order_date,
query_a_results."SUM(sales)" AS "SUM(sales)",
anon_1."SUM(sales)" AS "prev_SUM(sales)"
FROM
(SELECT deal_size AS deal_size,
postal_code AS postal_code,
order_date AS order_date,
sum(sales) AS "SUM(sales)"
FROM main.cleaned_sales_data
WHERE order_date >= '2024-02-21 00:00:00'
AND order_date < '2024-02-28 00:00:00'
GROUP BY deal_size,
postal_code,
order_date
ORDER BY "SUM(sales)" DESC) AS anon_1
JOIN query_a_results ON anon_1.deal_size = query_a_results.deal_size
AND anon_1.postal_code = query_a_results.postal_code
AND anon_1.order_date = query_a_results.order_date; Upon reviewing the JOIN logic in your code, the current implementation is an inner join. An inner join entails obtaining the intersection of two sets. I believe we might need to discuss what the comparison term signifies here. Regardless, if an option is selected—such as "enable time comparison"—and the previous result set (main query) is reduced by some rows, it could lead to a mistake. The other concerns would be performance issues in some OLAP system for the join query, e.g.: Clickhouse, but we don't discus this here. Next, I'd like to discuss the technical designs and technical debts in this PR. We have two kinds of objects: form_data, representing chart definitions, and query_object, representing queries. You can imagine the query_object as a structured query for a database. You've introduced two new fields in the query_object: However, we should consider a more generic question about this PR: How to generate a joined query (or joined DataFrame) in the query_object rather than creating a hack solution for a specific and experimental function in an important visualization. Finally, as a member of this project, I believe this PR lacks well-defined technical design and introduces new technical debt. Therefore, I will reject merging this PR. |
Hey @zhaoyongjie thanks for such detailed reply 🎉 First of all, let me tell you that I do value your input and totally respect it, we are both part of the project so we both want what's the best for it. As you stated, this is an experimental and very specific use case we are trying to solve here and because of that we're trying to actually making use of it to properly iterate and change/adjust as much as needed, that's why we thought of leveraging the safety net of a Feature Flag to not impact such an important visualization, since again, we are experimenting and we will bring all the proper perks a viz update needs like stories in Storybooks etc. I see the point of the extra properties in
Probably if we work together on defining that we could come to terms for this PR and the one waiting for review that moves all the shift logic to frontend only. Thanks as always. |
@zhaoyongjie thank you for your input. For a little more context, at Preset, we are trying to add comparisons by time to four different visualizations: bar, line, table and big number. There was a notable difference in how this feature would work if compared to the time filters in the frontend, and we found an important but necessary issue about comparing values when the queries were generated on the frontend, which is why we moved them to the backend. We appreciate all your help here, and our intention like yours is to approach this thoughtfully and with proper technical design. As you can see, @Antonio-RiveroMartnez has explained the use-case for us to fetch the data in one singular database call. If you could help us come to an agreement as to how to approach this, we'd be very grateful. Thanks! |
- Fix table rendering when switching from Agg mode to Raw mode and viceversa
@eschutho and @Antonio-RiveroMartnez, thank you for your replies. I'm happy to discuss this PR.
We should reconsider this question as "how to describe a
We shouldn't consider a filter as a shifted filter or an original filter; a time filter is simply a time filter. The PR has implemented how to obtain shifted filters value.
As mentioned before, we can't use database joins, but we should use Pandas DataFrame side joins. Currently, there are a couple of post-processing steps in the query_object, indicating how to transform the DataFrame after obtaining the database result. You need to implement the
I believe that time comparison is just one specific use case for this pattern. After implementing multiple queries in query_object and join operator, the query_object should support numerous other use cases. |
Awesome! Yes, I like the array approach, and not sending the custom property to tell apart when you need to join or not using the post processing. Now, what I'm not seeing is, post processing implies that the two queries are executed independently and we would be joining the result sets, correct? I see why this is more bullet proof (not all databases support JOIN) but at the same time we would need the join to happen at "pre" processing of the queries to accomplish a real comparison, correct? Otherwise you would be joining incomplete data frames at some point like I explaining with the |
In this design, I think we should don't care about how many "windows" to compare, the outer
a. join[main dataset, queries[0]]
Actually, I don't fully understand the relation between |
Great, but if let's say the query A is limited by K rows and sorted, and the inner queries are also sorted and limited by K rows, you cannot guarantee that a row in the DataFrame for query A will be present in the Dataframe of queryB, correct? Making the comparison not consistent/valid because it "could" be present "if the limit was not K but let's say K+N". A quick example of the use case where post processing and DataFrames don't seem to work and hence my question/doubts on how to approach? Let's say I want to run a query to get the top 2 sellers on my company, along with how much did they sell and sorted in desc order by the amount sold. All this using the time ranges of the
Let's say that Rick Martin was my 3rd best seller in QueryB so that's why he's not part of the DataFrame for QueryB, when creating the comparison metrics, if we apply post processing joins, The way I see it , that's an invalid comparison because we should have taken the metrics for So, that's why post processing serves well when the "comparison" means |
@zhaoyongjie I disagree here. IMHO, the most important thing for Superset is to show the correct data. Anything else is acceptable — it can be slow, it can be buggy — but showing incorrect values is something we should never allow. When you're looking at at metric with a time comparison, it's important to look at the database at a snapshot in time. If you run two separate queries and join in Pandas, there's a chance that the data is updated between the queries, and the results will be wrong. The only way to guarantee consistency is to run the query with a self join (which should be a This is why I suggested to Antonio to have a way of telling the backend that the query is a time comparison and the backend should do the self join. (For databases that don't support joins we can do the same approach we do for top groups: we run the 2 queries instead of 1. But this should be the exception, not the norm.) Edit: and as @Antonio-RiveroMartnez mentioned, there's another reason why we need to do a join on the database; so we can compare values when we have limits and pagination. |
Hi all. I was just going through the @Antonio-RiveroMartnez @kgabryje @zhaoyongjie @eschutho @betodealmeida @john-bodley @villebro @rusackas |
@betodealmeida, This viewpoint is completely invalid. The correctness of data is not determined by whether SQL or Pandas is used, but rather by which pattern do you use for creating a query and result set. BTW, The underlying layers of a database still translate the SQL into in-memory computing, sounds like a pandas calculation. This design pattern (always find a way that a database doesn't support JOIN clause) is already present in the current design: allows_joins property in BaseEngineSpec and get top series, and Rolling in AA. |
You're missing my point. The self join runs as a single transaction. The data being compared is exactly the same in both sides of the join, because it runs in a snapshot of the database. If you run two queries on a table that's being updated with high frequency, the data might change between the first and the second queries. The two results of the comparison come from different snapshots of the data, and the comparison might be invalid because of that.
I'm familiar with |
@betodealmeida Thanks for the explanation, currently, we don't have a transaction manager so if want to join in Pandas, the transaction context, at least keep the multiple queries in a same transaction should implement as well. |
Right, if we had a transaction manager, and we could run the two queries in the same transaction, and the isolation level of the transaction was either "repeatable read" or "serializable", then we'd be able to guarantee data correctness. |
The same issue should fix in the Rolling and Top series codebases. |
@zhaoyongjie I think in @michael-s-molina 's comment above, his suggestion to open a SIP for the approach proposed on this PR is reasonable. Do you think that you would also be willing to open a SIP advocating for your approach in moving the logic to the frontend in #27386? It seems that both approaches should be given equal consideration and voting opportunity by Apache process. I do agree with @betodealmeida that data correctness is the most important virtue in Superset, so that comes first. If both approaches can for sure always be correct/valid, then it comes down to which will have the fewest edge cases (i.e. complexities around pagination in a table chart) and general philosophy (what logic should the frontend's responsibility vs. the backend's responsibility in a visualization component). It might be a little unusual to have two SIPs that are mutually exclusive, but it might be a good chance to reset and let each approach make a case for itself before deciding by vote. |
If apply A INNER JOIN B on name, the result should be
If apply A LEFT JOIN B on name, the result should be
|
Ephemeral environment shutdown and build artifacts deleted. |
SUMMARY
We are introducing the same instant time comparison controls from
BigNumber with Time Comparison
plugin is using into theTable Chart
. This will allow users to quickly compare numeric data in tables to fixed time ranges or a custom time range.Same as the previous work with
BigNumber with Time Comparison
chart, these changes are behind the experimental Feature Flag and will have no effect on current charts if the FF is OFF or if the control is not activated in the panel.Different from the
BigNumber with Time Comparison
, we wont have multiple queries built in the frontend sent to the API, instead, we're expanding the original queries to send a new field that would indicate whether or not the query should produce a JOINed query with between the ranges.Any new viz type we want to use these new controls, we must use single query approach, just
BigNumber with Time Comparison
should be using the multiple queries approach until a follow up PR updates it.The main benefit of using single query and not multiple is that we can create the aforementioned JOIN statements to prepare correct results of queries, with the JOINS we can address comparison queries with Server Pagination for example which is not possible with running queries in isolation.
This PR must be followed by other non-blocking adjustments that are still in the work:
DATEDIFF
function to build the comparison time rangeBEFORE/AFTER SCREENSHOTS OR ANIMATED GIF
New metrics:
New Metrics with Server Pagination:
Regular Table if FF is OFF or the control is not being used:
TESTING INSTRUCTIONS
CHART_PLUGINS_EXPERIMENTAL
is ONADDITIONAL INFORMATION