-
Notifications
You must be signed in to change notification settings - Fork 538
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Min/Max] Apply filtered row behavior at the row level evaluation #543
[Min/Max] Apply filtered row behavior at the row level evaluation #543
Conversation
- This changes from applying the behavior at the analyzer level. It allows us to prevent the usage of MinValue/MaxValue as placeholder values for filtered rows.
case FilteredRowOutcome.TRUE => true | ||
case FilteredRowOutcome.NULL => null | ||
} | ||
case None => null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had discussed that the default analyzerOptions
behavior should be FilteredRowOutcome.TRUE
, should we modify 933 to be true? (By default filtered rows are true
instead of null
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was testing out another scenario that may be problematic.
Given the following dataframe:
+------+----+----+----+----+-----+-----+
|item |att1|att2|val1|val2|rule4|rule5|
+------+----+----+----+----+-----+-----+
|1 |a |f |1 |1 |true |true |
|22 |b |d |2 |NULL|true |true |
|333 |a |NULL|3 |3 |true |true |
|4444 |a |f |4 |4 |true |true |
|55555 |b |NULL|5 |NULL|true |true |
|666666|a |f |6 |6 |true |true |
+------+----+----+----+----+-----+-----+
where
val analyzerOptions = Option(AnalyzerOptions(filteredRow = FilteredRowOutcome.TRUE))
val min = new Check(CheckLevel.Error, "rule4")
.hasMin("val2", _ > 0, None, analyzerOptions)
.where("val1 < 4")
val max = new Check(CheckLevel.Error, "rule5")
.hasMax("val2", _ < 4, None, analyzerOptions)
.where("val1 < 4")
You'll see that rows 1,2,3 should be skipped -> True
Row 5 should be null as val2 is a null value there.
However, with the above method we convert all nulls to true/null - this doesn't distinguish between null values due to being filtered or null values due to null column values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @eycho-am for the valuable feedback. The latest PR revision contains a new structure for the column that helps maintain the "source" of a row, whether it is in scope and filtered out. That will help in evaluating the correct outcome for each row.
@@ -374,11 +371,11 @@ class VerificationSuiteTest extends WordSpec with Matchers with SparkContextSpec | |||
|
|||
// filtered rows 1, 2, 3 (where item > 3) | |||
val minRowLevel = resultData.select(expectedColumn4).collect().map(r => r.getAs[Any](0)) | |||
assert(Seq(true, true, true, true, true, true).sameElements(minRowLevel)) | |||
assert(Seq(null, null, null, true, true, true).sameElements(minRowLevel)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these test were written with the intention that without specifying analyzer options, the default behavior would be filtered rows are true
- related to above comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted the change.
- Whether the outcome for a row is null because of being filtered out or due to the target column being null, is now stored in the outcome column itself. - We could have reused the placeholder value to find out if a row was originally filtered out, but that would not work if the actual value in the row was the same originally.
case FilteredRowOutcome.TRUE => true | ||
case FilteredRowOutcome.NULL => null | ||
} | ||
case None => null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: similar comment for the below test-case, should the default behavior for filtered rows be true? Or are we making a decision here that they'll be null by default?
Looks good to me otherwise, I think we just need to make final a decision on this point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are keeping them as is, unless the analyzer option says otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on offline discussion, updated the logic. Thanks @eycho-am for the feedback.
We recently fixed the outcome of filtered rows and made them default to true instead of false, which was a bug earlier. This change maintains that behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Not having it can cause match error.
* [Min/Max] Apply filtered row behavior at the row level evaluation - This changes from applying the behavior at the analyzer level. It allows us to prevent the usage of MinValue/MaxValue as placeholder values for filtered rows. * Improved the separation of null rows, based on their source - Whether the outcome for a row is null because of being filtered out or due to the target column being null, is now stored in the outcome column itself. - We could have reused the placeholder value to find out if a row was originally filtered out, but that would not work if the actual value in the row was the same originally. * Mark filtered rows as true We recently fixed the outcome of filtered rows and made them default to true instead of false, which was a bug earlier. This change maintains that behavior. * Added null behavior - empty string to match block Not having it can cause match error.
* [Min/Max] Apply filtered row behavior at the row level evaluation - This changes from applying the behavior at the analyzer level. It allows us to prevent the usage of MinValue/MaxValue as placeholder values for filtered rows. * Improved the separation of null rows, based on their source - Whether the outcome for a row is null because of being filtered out or due to the target column being null, is now stored in the outcome column itself. - We could have reused the placeholder value to find out if a row was originally filtered out, but that would not work if the actual value in the row was the same originally. * Mark filtered rows as true We recently fixed the outcome of filtered rows and made them default to true instead of false, which was a bug earlier. This change maintains that behavior. * Added null behavior - empty string to match block Not having it can cause match error.
* [Min/Max] Apply filtered row behavior at the row level evaluation - This changes from applying the behavior at the analyzer level. It allows us to prevent the usage of MinValue/MaxValue as placeholder values for filtered rows. * Improved the separation of null rows, based on their source - Whether the outcome for a row is null because of being filtered out or due to the target column being null, is now stored in the outcome column itself. - We could have reused the placeholder value to find out if a row was originally filtered out, but that would not work if the actual value in the row was the same originally. * Mark filtered rows as true We recently fixed the outcome of filtered rows and made them default to true instead of false, which was a bug earlier. This change maintains that behavior. * Added null behavior - empty string to match block Not having it can cause match error.
* [Min/Max] Apply filtered row behavior at the row level evaluation - This changes from applying the behavior at the analyzer level. It allows us to prevent the usage of MinValue/MaxValue as placeholder values for filtered rows. * Improved the separation of null rows, based on their source - Whether the outcome for a row is null because of being filtered out or due to the target column being null, is now stored in the outcome column itself. - We could have reused the placeholder value to find out if a row was originally filtered out, but that would not work if the actual value in the row was the same originally. * Mark filtered rows as true We recently fixed the outcome of filtered rows and made them default to true instead of false, which was a bug earlier. This change maintains that behavior. * Added null behavior - empty string to match block Not having it can cause match error.
* [Min/Max] Apply filtered row behavior at the row level evaluation - This changes from applying the behavior at the analyzer level. It allows us to prevent the usage of MinValue/MaxValue as placeholder values for filtered rows. * Improved the separation of null rows, based on their source - Whether the outcome for a row is null because of being filtered out or due to the target column being null, is now stored in the outcome column itself. - We could have reused the placeholder value to find out if a row was originally filtered out, but that would not work if the actual value in the row was the same originally. * Mark filtered rows as true We recently fixed the outcome of filtered rows and made them default to true instead of false, which was a bug earlier. This change maintains that behavior. * Added null behavior - empty string to match block Not having it can cause match error.
* [Min/Max] Apply filtered row behavior at the row level evaluation - This changes from applying the behavior at the analyzer level. It allows us to prevent the usage of MinValue/MaxValue as placeholder values for filtered rows. * Improved the separation of null rows, based on their source - Whether the outcome for a row is null because of being filtered out or due to the target column being null, is now stored in the outcome column itself. - We could have reused the placeholder value to find out if a row was originally filtered out, but that would not work if the actual value in the row was the same originally. * Mark filtered rows as true We recently fixed the outcome of filtered rows and made them default to true instead of false, which was a bug earlier. This change maintains that behavior. * Added null behavior - empty string to match block Not having it can cause match error.
* [Min/Max] Apply filtered row behavior at the row level evaluation - This changes from applying the behavior at the analyzer level. It allows us to prevent the usage of MinValue/MaxValue as placeholder values for filtered rows. * Improved the separation of null rows, based on their source - Whether the outcome for a row is null because of being filtered out or due to the target column being null, is now stored in the outcome column itself. - We could have reused the placeholder value to find out if a row was originally filtered out, but that would not work if the actual value in the row was the same originally. * Mark filtered rows as true We recently fixed the outcome of filtered rows and made them default to true instead of false, which was a bug earlier. This change maintains that behavior. * Added null behavior - empty string to match block Not having it can cause match error.
Description of changes:
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.