Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String Length Check #14

Merged
merged 7 commits into from
Jun 27, 2019
Merged

String Length Check #14

merged 7 commits into from
Jun 27, 2019

Conversation

samratmitra-0812
Copy link
Collaborator

EEDD-1020: String Length Check.
Unit Test results attached in https://jira.target.com/browse/EEDD-1020

Copy link
Collaborator

@colindean colindean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few changes to the readme that I'd want to see.

Great work on the thorough tests. An improvement opportunity I see would reduce duplication inside of many of the tests. Move commonly used things into a test environment trait or collection of them and extend that environment in the test, e.g.

//old
it("minValue") {
  val dict = new VarSubstitution
  //… the rest
}

//new
it("minValue") in new VarSubstutitionSetup {
  //… the rest
}

trait VarSubstitutionSetup {
  val dict = new VarSubstitution
}

This is a really shallow example, but there are some other duplicate things or near duplicates that you could move into this trait or other traits, e.g. VarSubstitutionSetup with YamlDecoder and that latter trait is defined as

trait YamlDecoder { 
  def decodeYaml(yaml: String) = io.circe.yaml.parser.parse(yaml).right.getOrElse(Json.Null)
}

This "cake building" would let you reduce your tests in many cases to only the assertions, maybe with one additional line that calls methods provided by the test environment traits.

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
Copy link
Collaborator

@dougb dougb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@samratmitra-0812 This is great!

The tests are pretty consistent with the existing code, but @colindean comments are good. I think we should have a separate pr to cleanup the test like what Colin suggests. I tried to add some handy test functions in #13 but something like what colin suggested might be better.

I would be fine with just making the suggested changes to the README.md and getting this merged, and fixing the tests in a future pr after #13 is merged.

samratmitra-0812 and others added 3 commits June 25, 2019 12:26
Co-Authored-By: Colin Dean <[email protected]>
Co-Authored-By: Colin Dean <[email protected]>
Co-Authored-By: Colin Dean <[email protected]>
@samratmitra-0812
Copy link
Collaborator Author

@dougb I completely agree with you. We should really take up on colin's suggestions and do the refactoring. Let us make sure we have a JIRA ticket for this.

I have commited the suggestions to README.md.

Copy link
Collaborator

@dougb dougb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍
I might have found a problem testing this on BR, need to dig a little more. Don't merge it yet.

Copy link
Contributor

@phpisciuneri phpisciuneri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@samratmitra-0812 This is really nice, thank you for contributing. I don't like the ordering assumption we make on the asserts on the tests. I am curious to hear what @dougb thinks. I am OK with addressing this in the test refactoring PR.

import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute
import org.apache.spark.sql.catalyst.expressions._
import org.apache.spark.sql.types.{DataType, IntegerType, StringType, StructType}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@samratmitra-0812 DataType is an unused import


import com.target.TestingSparkSession
import com.target.data_validator._
import com.target.data_validator.validator.ValidatorBase.D0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused import

describe("configCheck") {

it("error if min and max are not defined") {
val dict = new VarSubstitution
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dict is unused

}

it("error if min and max are different types") {
val dict = new VarSubstitution
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dict is unused

}

it("error if column is not found in df") {
val dict = new VarSubstitution
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dict is unused

}

it("error if col type is not String") {
val dict = new VarSubstitution
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dict is unused

}

it("minValue less than maxValue fails configCheck") {
val dict = new VarSubstitution
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dict is unused

assert(sut.failed)
assert(sut.getEvents contains
ValidatorCheckEvent(failure = true, "StringLengthCheck on column 'item'", 4, 2))
assert(sut.getEvents contains
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@samratmitra-0812 @dougb I think we need to be careful here, and in other tests below.

This assert is based on the original ordering of defData. There is no guarantee that when converted to a dataframe, partitioned, and operated on that we will get results back in the same order. Since there are two potential modes of failure here this result could very well be the other one (empty string).

https://issues.apache.org/jira/browse/SPARK-16207?focusedCommentId=15349509&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15349509

@samratmitra-0812
Copy link
Collaborator Author

samratmitra-0812 commented Jun 26, 2019

@phpisciuneri Thanks for your comments. I have removed the unused imports and variables.

Nice catch on the assert ! I have made modifications by checking that exactly one of the 2 errors is present in the event list.

The issue does not exist for other tests because for all of them, the number of invalid rows is equal to the specified value of numErrorsToReport. So we can expect errors for all the invalid rows to be present in the list.

Copy link
Collaborator

@colindean colindean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll clear my RC if you'll refactor the tests in a future PR 😉

@dougb dougb merged commit 1743579 into master Jun 27, 2019
@dougb dougb deleted the StringLengthCheck branch June 27, 2019 01:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants