[Min/Max] Apply filtered row behavior at the row level evaluation #543

rdsharma26 · 2024-03-05T21:55:59Z

Description of changes:

This changes from applying the behavior at the analyzer level. It allows us to prevent the usage of MinValue/MaxValue as placeholder values for filtered rows.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

- This changes from applying the behavior at the analyzer level. It allows us to prevent the usage of MinValue/MaxValue as placeholder values for filtered rows.

eycho-am · 2024-03-06T15:28:23Z

src/main/scala/com/amazon/deequ/constraints/Constraint.scala

+          case FilteredRowOutcome.TRUE => true
+          case FilteredRowOutcome.NULL => null
+        }
+        case None => null


We had discussed that the default analyzerOptions behavior should be FilteredRowOutcome.TRUE, should we modify 933 to be true? (By default filtered rows are true instead of null.

I was testing out another scenario that may be problematic.
Given the following dataframe:

+------+----+----+----+----+-----+-----+ |item |att1|att2|val1|val2|rule4|rule5| +------+----+----+----+----+-----+-----+ |1 |a |f |1 |1 |true |true | |22 |b |d |2 |NULL|true |true | |333 |a |NULL|3 |3 |true |true | |4444 |a |f |4 |4 |true |true | |55555 |b |NULL|5 |NULL|true |true | |666666|a |f |6 |6 |true |true | +------+----+----+----+----+-----+-----+

where

val analyzerOptions = Option(AnalyzerOptions(filteredRow = FilteredRowOutcome.TRUE)) val min = new Check(CheckLevel.Error, "rule4") .hasMin("val2", _ > 0, None, analyzerOptions) .where("val1 < 4") val max = new Check(CheckLevel.Error, "rule5") .hasMax("val2", _ < 4, None, analyzerOptions) .where("val1 < 4")

You'll see that rows 1,2,3 should be skipped -> True
Row 5 should be null as val2 is a null value there.
However, with the above method we convert all nulls to true/null - this doesn't distinguish between null values due to being filtered or null values due to null column values.

Thanks @eycho-am for the valuable feedback. The latest PR revision contains a new structure for the column that helps maintain the "source" of a row, whether it is in scope and filtered out. That will help in evaluating the correct outcome for each row.

eycho-am · 2024-03-06T15:29:00Z

src/test/scala/com/amazon/deequ/VerificationSuiteTest.scala

@@ -374,11 +371,11 @@ class VerificationSuiteTest extends WordSpec with Matchers with SparkContextSpec

      // filtered rows 1, 2, 3 (where item > 3)
      val minRowLevel = resultData.select(expectedColumn4).collect().map(r => r.getAs[Any](0))
-      assert(Seq(true, true, true, true, true, true).sameElements(minRowLevel))
+      assert(Seq(null, null, null, true, true, true).sameElements(minRowLevel))


these test were written with the intention that without specifying analyzer options, the default behavior would be filtered rows are true - related to above comment.

Reverted the change.

- Whether the outcome for a row is null because of being filtered out or due to the target column being null, is now stored in the outcome column itself. - We could have reused the placeholder value to find out if a row was originally filtered out, but that would not work if the actual value in the row was the same originally.

eycho-am · 2024-03-07T20:36:16Z

src/main/scala/com/amazon/deequ/constraints/Constraint.scala

+              case FilteredRowOutcome.TRUE => true
+              case FilteredRowOutcome.NULL => null
+            }
+          case None => null


nit: similar comment for the below test-case, should the default behavior for filtered rows be true? Or are we making a decision here that they'll be null by default?

Looks good to me otherwise, I think we just need to make final a decision on this point

We are keeping them as is, unless the analyzer option says otherwise.

Based on offline discussion, updated the logic. Thanks @eycho-am for the feedback.

We recently fixed the outcome of filtered rows and made them default to true instead of false, which was a bug earlier. This change maintains that behavior.

eycho-am

LGTM

Not having it can cause match error.

* [Min/Max] Apply filtered row behavior at the row level evaluation - This changes from applying the behavior at the analyzer level. It allows us to prevent the usage of MinValue/MaxValue as placeholder values for filtered rows. * Improved the separation of null rows, based on their source - Whether the outcome for a row is null because of being filtered out or due to the target column being null, is now stored in the outcome column itself. - We could have reused the placeholder value to find out if a row was originally filtered out, but that would not work if the actual value in the row was the same originally. * Mark filtered rows as true We recently fixed the outcome of filtered rows and made them default to true instead of false, which was a bug earlier. This change maintains that behavior. * Added null behavior - empty string to match block Not having it can cause match error.

…slabs#543) * [Min/Max] Apply filtered row behavior at the row level evaluation - This changes from applying the behavior at the analyzer level. It allows us to prevent the usage of MinValue/MaxValue as placeholder values for filtered rows. * Improved the separation of null rows, based on their source - Whether the outcome for a row is null because of being filtered out or due to the target column being null, is now stored in the outcome column itself. - We could have reused the placeholder value to find out if a row was originally filtered out, but that would not work if the actual value in the row was the same originally. * Mark filtered rows as true We recently fixed the outcome of filtered rows and made them default to true instead of false, which was a bug earlier. This change maintains that behavior. * Added null behavior - empty string to match block Not having it can cause match error.

[Min/Max] Apply filtered row behavior at the row level evaluation

42d2425

- This changes from applying the behavior at the analyzer level. It allows us to prevent the usage of MinValue/MaxValue as placeholder values for filtered rows.

eycho-am reviewed Mar 6, 2024

View reviewed changes

eycho-am reviewed Mar 7, 2024

View reviewed changes

Mark filtered rows as true

1195dcf

We recently fixed the outcome of filtered rows and made them default to true instead of false, which was a bug earlier. This change maintains that behavior.

eycho-am approved these changes Mar 7, 2024

View reviewed changes

Added null behavior - empty string to match block

724df47

Not having it can cause match error.

rdsharma26 merged commit a6c218c into awslabs:master Mar 8, 2024
1 check passed

rdsharma26 deleted the min-max-row-level-results branch March 8, 2024 18:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Min/Max] Apply filtered row behavior at the row level evaluation #543

[Min/Max] Apply filtered row behavior at the row level evaluation #543

rdsharma26 commented Mar 5, 2024

eycho-am Mar 6, 2024

eycho-am Mar 6, 2024

rdsharma26 Mar 7, 2024

eycho-am Mar 6, 2024

rdsharma26 Mar 7, 2024

eycho-am Mar 7, 2024 •

edited

Loading

rdsharma26 Mar 7, 2024

rdsharma26 Mar 7, 2024

eycho-am left a comment

[Min/Max] Apply filtered row behavior at the row level evaluation #543

[Min/Max] Apply filtered row behavior at the row level evaluation #543

Conversation

rdsharma26 commented Mar 5, 2024

eycho-am Mar 6, 2024

Choose a reason for hiding this comment

eycho-am Mar 6, 2024

Choose a reason for hiding this comment

rdsharma26 Mar 7, 2024

Choose a reason for hiding this comment

eycho-am Mar 6, 2024

Choose a reason for hiding this comment

rdsharma26 Mar 7, 2024

Choose a reason for hiding this comment

eycho-am Mar 7, 2024 • edited Loading

Choose a reason for hiding this comment

rdsharma26 Mar 7, 2024

Choose a reason for hiding this comment

rdsharma26 Mar 7, 2024

Choose a reason for hiding this comment

eycho-am left a comment

Choose a reason for hiding this comment

eycho-am Mar 7, 2024 •

edited

Loading