-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apply sample limit to UDFailedRowsExpressionQuery #1986
base: main
Are you sure you want to change the base?
Apply sample limit to UDFailedRowsExpressionQuery #1986
Conversation
This changes the way samples are collected from UserDefinedFailedRowsExpressionQueries. In order to apply the samples limit given in the check configuration, or apply the default samples limit, the failed rows expression needs to fire off an additional SampleQuery. The SampleQuery executes a copy of the query, appending a LIMIT clause. [sodadata#1985](sodadata#1985)
Quality Gate passedKudos, no new issues were introduced! 0 New issues |
@denisdallinga I'm working on a PR to address this issue. Maybe we can team up to get more momentum on this issue? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this contribution! It looks like this works, but I would suggest a few changes to make it consistent with how other check work:
- use
.fetchone()
method of the query class to get the result - set the check metric value
- if there are results and check has samples limit >0 then create a new sample query object with limit
You can check duplicate check query that does exactly that for inspiration, it should be pretty much 100% copy-paste except for the sql itself. https://github.com/sodadata/soda-core/blob/main/soda/core/soda/execution/query/duplicates_query.py#L95-L107
One note - please keep in mind that different data sources use different syntax for limit - TOP/FETCH FIRST, and it's even different syntax (you dont just append it to the end of the query) - so the way to go here is to:
- refactor
get_failed_rows_sql
to take the sql string itself from the data_source class - override that sql generation in each datasource that does not use LIMIT - the ones that need that override have the
LIMIT_KEYWORD
attribute overridden
Lastly - this should be covered by tests as its error-prone with all the variances in how limit is done across data sources, there already is a test for how the limit keyword is applied here, it just needs to be adjusted to include user defined check with expression.
I hope this is not too much!
Github Issue: #1985
This changes the way samples are collected from
UserDefinedFailedRowsExpressionQueries
. In order to apply the samples limit given in the check configuration, or apply the default samples limit, the failed rows expression needs to fire off an additional SampleQuery. The SampleQuery executes a copy of the query, appending a LIMIT clause.