Configurable RetainCompletenessRule #564

zeotuan · 2024-04-19T01:07:20Z

Close #340
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

zeotuan · 2024-04-19T01:20:22Z

src/main/scala/com/amazon/deequ/suggestions/rules/RetainCompletenessRule.scala

  */
-case class RetainCompletenessRule() extends ConstraintRule[ColumnProfile] {
-
+case class RetainCompletenessRule(minCompleteness: Double = 0.2, maxCompleteness: Double = 1.0) extends ConstraintRule[ColumnProfile] {


Decided not to Parameterize z-value likes in original implementation. Due to the fact that it is related to a specific Interval Calculation Techniques. If possible, we can work into parameterize the strategy use to calculating the interval #563

Thanks @zeotuan
Can you trim this line to below 120 characters? It is failing checkstyle and failing the build.

Can we also store the values 0.2 and 1.0 as constants ?

zeotuan · 2024-04-22T05:05:54Z

@rdsharma26 Hi, Please help review this PR.

rdsharma26

Thank you for addressing the feedback. LGTM.

* Configurable RetainCompletenessRule * Add doc string * Add default completeness const

* Configurable RetainCompletenessRule (#564) * Configurable RetainCompletenessRule * Add doc string * Add default completeness const * Optional specification of instance name in CustomSQL analyzer metric. (#569) Co-authored-by: Tyler Mcdaniel <[email protected]> * Adding Wilson Score Confidence Interval Strategy (#567) * Configurable RetainCompletenessRule * Add doc string * Add default completeness const * Add ConfidenceIntervalStrategy * Add Separate Wilson and Wald Interval Test * Add License information, Fix formatting * Add License information * formatting fix * Update documentation * Make WaldInterval the default strategy for now * Formatting import to per line * Separate group import to per line import * CustomAggregator (#572) * Add support for EntityTypes dqdl rule * Add support for Conditional Aggregation Analyzer --------- Co-authored-by: Joshua Zexter <[email protected]> * fix typo (#574) * Fix performance of building row-level results (#577) * Generate row-level results with withColumns Iteratively using withColumn (singular) causes performance issues when iterating over a large sequence of columns. * Add back UNIQUENESS_ID * Replace 'withColumns' with 'select' (#582) 'withColumns' was introduced in Spark 3.3, so it won't work for Deequ's <3.3 builds. * Replace rdd with dataframe functions in Histogram analyzer (#586) Co-authored-by: Shriya Vanvari <[email protected]> * Updated version in pom.xml to 2.0.8-spark-3.4 --------- Co-authored-by: zeotuan <[email protected]> Co-authored-by: tylermcdaniel0 <[email protected]> Co-authored-by: Tyler Mcdaniel <[email protected]> Co-authored-by: Joshua Zexter <[email protected]> Co-authored-by: Joshua Zexter <[email protected]> Co-authored-by: bojackli <[email protected]> Co-authored-by: Josh <[email protected]> Co-authored-by: Shriya Vanvari <[email protected]> Co-authored-by: Shriya Vanvari <[email protected]>

* Configurable RetainCompletenessRule (#564) * Configurable RetainCompletenessRule * Add doc string * Add default completeness const * Optional specification of instance name in CustomSQL analyzer metric. (#569) Co-authored-by: Tyler Mcdaniel <[email protected]> * Adding Wilson Score Confidence Interval Strategy (#567) * Configurable RetainCompletenessRule * Add doc string * Add default completeness const * Add ConfidenceIntervalStrategy * Add Separate Wilson and Wald Interval Test * Add License information, Fix formatting * Add License information * formatting fix * Update documentation * Make WaldInterval the default strategy for now * Formatting import to per line * Separate group import to per line import * CustomAggregator (#572) * Add support for EntityTypes dqdl rule * Add support for Conditional Aggregation Analyzer --------- Co-authored-by: Joshua Zexter <[email protected]> * fix typo (#574) * Fix performance of building row-level results (#577) * Generate row-level results with withColumns Iteratively using withColumn (singular) causes performance issues when iterating over a large sequence of columns. * Add back UNIQUENESS_ID * Replace 'withColumns' with 'select' (#582) 'withColumns' was introduced in Spark 3.3, so it won't work for Deequ's <3.3 builds. * Replace rdd with dataframe functions in Histogram analyzer (#586) Co-authored-by: Shriya Vanvari <[email protected]> * Match Breeze version with spark 3.3 (#562) * Updated version in pom.xml to 2.0.8-spark-3.3 --------- Co-authored-by: zeotuan <[email protected]> Co-authored-by: tylermcdaniel0 <[email protected]> Co-authored-by: Tyler Mcdaniel <[email protected]> Co-authored-by: Joshua Zexter <[email protected]> Co-authored-by: Joshua Zexter <[email protected]> Co-authored-by: bojackli <[email protected]> Co-authored-by: Josh <[email protected]> Co-authored-by: Shriya Vanvari <[email protected]> Co-authored-by: Shriya Vanvari <[email protected]>

* Configurable RetainCompletenessRule (#564) * Configurable RetainCompletenessRule * Add doc string * Add default completeness const * Optional specification of instance name in CustomSQL analyzer metric. (#569) Co-authored-by: Tyler Mcdaniel <[email protected]> * Adding Wilson Score Confidence Interval Strategy (#567) * Configurable RetainCompletenessRule * Add doc string * Add default completeness const * Add ConfidenceIntervalStrategy * Add Separate Wilson and Wald Interval Test * Add License information, Fix formatting * Add License information * formatting fix * Update documentation * Make WaldInterval the default strategy for now * Formatting import to per line * Separate group import to per line import * CustomAggregator (#572) * Add support for EntityTypes dqdl rule * Add support for Conditional Aggregation Analyzer --------- Co-authored-by: Joshua Zexter <[email protected]> * fix typo (#574) * Fix performance of building row-level results (#577) * Generate row-level results with withColumns Iteratively using withColumn (singular) causes performance issues when iterating over a large sequence of columns. * Add back UNIQUENESS_ID * Replace 'withColumns' with 'select' (#582) 'withColumns' was introduced in Spark 3.3, so it won't work for Deequ's <3.3 builds. * Replace rdd with dataframe functions in Histogram analyzer (#586) Co-authored-by: Shriya Vanvari <[email protected]> * Updated version in pom.xml to 2.0.8-spark-3.2 --------- Co-authored-by: zeotuan <[email protected]> Co-authored-by: tylermcdaniel0 <[email protected]> Co-authored-by: Tyler Mcdaniel <[email protected]> Co-authored-by: Joshua Zexter <[email protected]> Co-authored-by: Joshua Zexter <[email protected]> Co-authored-by: bojackli <[email protected]> Co-authored-by: Josh <[email protected]> Co-authored-by: Shriya Vanvari <[email protected]> Co-authored-by: Shriya Vanvari <[email protected]>

* Configurable RetainCompletenessRule (#564) * Configurable RetainCompletenessRule * Add doc string * Add default completeness const * Optional specification of instance name in CustomSQL analyzer metric. (#569) Co-authored-by: Tyler Mcdaniel <[email protected]> * Adding Wilson Score Confidence Interval Strategy (#567) * Configurable RetainCompletenessRule * Add doc string * Add default completeness const * Add ConfidenceIntervalStrategy * Add Separate Wilson and Wald Interval Test * Add License information, Fix formatting * Add License information * formatting fix * Update documentation * Make WaldInterval the default strategy for now * Formatting import to per line * Separate group import to per line import * CustomAggregator (#572) * Add support for EntityTypes dqdl rule * Add support for Conditional Aggregation Analyzer --------- Co-authored-by: Joshua Zexter <[email protected]> * fix typo (#574) * Fix performance of building row-level results (#577) * Generate row-level results with withColumns Iteratively using withColumn (singular) causes performance issues when iterating over a large sequence of columns. * Add back UNIQUENESS_ID * Replace 'withColumns' with 'select' (#582) 'withColumns' was introduced in Spark 3.3, so it won't work for Deequ's <3.3 builds. * Replace rdd with dataframe functions in Histogram analyzer (#586) Co-authored-by: Shriya Vanvari <[email protected]> * pdated version in pom.xml to 2.0.8-spark-3.1 --------- Co-authored-by: zeotuan <[email protected]> Co-authored-by: tylermcdaniel0 <[email protected]> Co-authored-by: Tyler Mcdaniel <[email protected]> Co-authored-by: Joshua Zexter <[email protected]> Co-authored-by: Joshua Zexter <[email protected]> Co-authored-by: bojackli <[email protected]> Co-authored-by: Josh <[email protected]> Co-authored-by: Shriya Vanvari <[email protected]> Co-authored-by: Shriya Vanvari <[email protected]>

zeotuan added 2 commits April 19, 2024 11:06

Configurable RetainCompletenessRule

3b41e4c

Add doc string

ac337ea

zeotuan commented Apr 19, 2024

View reviewed changes

Add default completeness const

db9b764

zeotuan requested a review from rdsharma26 May 1, 2024 08:21

rdsharma26 approved these changes May 6, 2024

View reviewed changes

rdsharma26 merged commit 49e970c into awslabs:master May 6, 2024
1 check passed

zeotuan deleted the TPM/RetainCompleteness branch May 9, 2024 01:26

eycho-am pushed a commit to eycho-am/deequ that referenced this pull request Oct 9, 2024

Configurable RetainCompletenessRule (awslabs#564)

e935e18

* Configurable RetainCompletenessRule * Add doc string * Add default completeness const

eycho-am pushed a commit to eycho-am/deequ that referenced this pull request Oct 9, 2024

Configurable RetainCompletenessRule (awslabs#564)

b474a37

* Configurable RetainCompletenessRule * Add doc string * Add default completeness const

eycho-am pushed a commit to eycho-am/deequ that referenced this pull request Oct 9, 2024

Configurable RetainCompletenessRule (awslabs#564)

56053d9

* Configurable RetainCompletenessRule * Add doc string * Add default completeness const

eycho-am pushed a commit to eycho-am/deequ that referenced this pull request Oct 9, 2024

Configurable RetainCompletenessRule (awslabs#564)

9a6713b

* Configurable RetainCompletenessRule * Add doc string * Add default completeness const

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configurable RetainCompletenessRule #564

Configurable RetainCompletenessRule #564

zeotuan commented Apr 19, 2024

zeotuan Apr 19, 2024

rdsharma26 Apr 30, 2024

rdsharma26 Apr 30, 2024

zeotuan commented Apr 22, 2024

rdsharma26 left a comment

Configurable RetainCompletenessRule #564

Configurable RetainCompletenessRule #564

Conversation

zeotuan commented Apr 19, 2024

zeotuan Apr 19, 2024

Choose a reason for hiding this comment

rdsharma26 Apr 30, 2024

Choose a reason for hiding this comment

rdsharma26 Apr 30, 2024

Choose a reason for hiding this comment

zeotuan commented Apr 22, 2024

rdsharma26 left a comment

Choose a reason for hiding this comment