Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parenthesis to check filter #2033

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

abargar
Copy link

@abargar abargar commented Mar 11, 2024

What this does

Fixes the following bug: #2021

The bug

OR conditions in a check-level filter are not properly encapsulated, which can cause one of the OR predicates to override the other conditions in a SQL query. See the attached issue for examples: #2021

It affects:

  • sampling SQL queries for checks
  • the DuplicateQuery used for both sampling and metric calculation in duplicate_count checks.

The fix

I wrap parentheses around the check-level filter, e.g. a OR b becomes (a OR b).

I focused on the following checks based on the documentation:
Screenshot 2024-03-11 at 12 06 27

  • [-] reference check
    This check does not permit check-specific filters, so it is not affected by the bug.
  • missing_count check
  • missing_percent check
  • invalid_count check
  • invalid_percent check
  • duplicate_count check
  • duplicate_percent check

Testing

Test - Initial Behavior

check failure count sample count sample ids
missing_count 1 7 ['ID1', 'ID2', 'ID3', 'ID4', 'ID5', 'ID6', 'ID7']
missing_percent 1 7 ['ID1', 'ID2', 'ID3', 'ID4', 'ID5', 'ID6', 'ID7']
invalid_count 2 6 ['ID3', 'ID4', 'ID7', 'ID8', 'ID9', None]
invalid_percent 2 6 ['ID3', 'ID4', 'ID7', 'ID8', 'ID9', None]
duplicate_count 2 3 ['ID1', 'ID2', None]
duplicate_percent 2 3 ['ID1', 'ID2', None]

Test - Final Behavior

check failure count sample count sample ids
missing_count 1 1 ['ID7']
missing_percent 1 1 ['ID7']
invalid_count 2 2 ['ID3', 'ID4']
invalid_percent 2 2 ['ID3', 'ID4']
duplicate_count 3 3 ['ID1', 'ID2', None]
duplicate_percent 3 3 ['ID1', 'ID2', None]

@CLAassistant
Copy link

CLAassistant commented Mar 11, 2024

CLA assistant check
All committers have signed the CLA.

@abargar
Copy link
Author

abargar commented Mar 12, 2024

Note that the quotations-around-column-name formatting caused inconsistent behavior with the SparkDF DataSource. This was easily fixable in this case, but might be good to note in the following doc:
Screenshot 2024-03-12 at 15 44 44

The test formatting was fixed in the commit dda387f, so this is ready for a review and rerunning the data source tests.

@abargar abargar marked this pull request as ready for review March 12, 2024 14:48
@abargar
Copy link
Author

abargar commented Mar 28, 2024

Hey! Just checking in. Anything I need to do to help see this PR through? :)

@m1n0
Copy link
Contributor

m1n0 commented Apr 23, 2024

hi @abargar , isn't the same issue present with the metric filter as well here?

Copy link

Quality Gate Passed Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants