Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

java.lang.IllegalArgumentException when using parquet file #69

Open
JyotiRSharma opened this issue Jan 11, 2022 · 5 comments
Open

java.lang.IllegalArgumentException when using parquet file #69

JyotiRSharma opened this issue Jan 11, 2022 · 5 comments

Comments

@JyotiRSharma
Copy link

When trying to run a config check on a parquet file, the following error can be seen:

root@lubuntu:/home/jyoti/Spark# /opt/spark/spark-3.1.2-bin-hadoop3.2/bin/spark-submit --num-executors 10 --executor-cores 2 data-validator-assembly-20220111T034941.jar --config config.yaml
22/01/11 11:50:53 WARN Utils: Your hostname, lubuntu resolves to a loopback address: 127.0.1.1; using 192.168.195.131 instead (on interface ens33)
22/01/11 11:50:53 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
22/01/11 11:50:57 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/01/11 11:50:59 INFO Main$: Logging configured!
22/01/11 11:51:00 INFO Main$: Data Validator
22/01/11 11:51:01 INFO ConfigParser$: Parsing `config.yaml`
22/01/11 11:51:01 INFO ConfigParser$: Attempting to load `config.yaml` from file system
Exception in thread "main" java.lang.ExceptionInInitializerError
	at com.target.data_validator.validator.RowBased.<init>(RowBased.scala:11)
	at com.target.data_validator.validator.NullCheck.<init>(NullCheck.scala:12)
	at com.target.data_validator.validator.NullCheck$.fromJson(NullCheck.scala:37)
	at com.target.data_validator.validator.JsonDecoders$$anon$7$$anonfun$decoders$2.apply(JsonDecoders.scala:16)
	at com.target.data_validator.validator.JsonDecoders$$anon$7$$anonfun$decoders$2.apply(JsonDecoders.scala:16)
	at com.target.data_validator.validator.JsonDecoders$$anon$7$$anonfun$2.apply(JsonDecoders.scala:32)
	at com.target.data_validator.validator.JsonDecoders$$anon$7$$anonfun$2.apply(JsonDecoders.scala:32)
	at scala.Option.map(Option.scala:230)
	at com.target.data_validator.validator.JsonDecoders$$anon$7.com$target$data_validator$validator$JsonDecoders$$anon$$getDecoder(JsonDecoders.scala:32)
	at com.target.data_validator.validator.JsonDecoders$$anon$7$$anonfun$apply$3.apply(JsonDecoders.scala:27)
	at com.target.data_validator.validator.JsonDecoders$$anon$7$$anonfun$apply$3.apply(JsonDecoders.scala:27)
	at cats.syntax.EitherOps$.flatMap$extension(either.scala:149)
	at com.target.data_validator.validator.JsonDecoders$$anon$7.apply(JsonDecoders.scala:27)
	at io.circe.SeqDecoder.apply(SeqDecoder.scala:17)
	at io.circe.Decoder$class.tryDecode(Decoder.scala:36)
	at io.circe.SeqDecoder.tryDecode(SeqDecoder.scala:6)
	at com.target.data_validator.ConfigParser$anon$importedDecoder$macro$15$1$$anon$6.apply(ConfigParser.scala:21)
	at io.circe.generic.decoding.DerivedDecoder$$anon$1.apply(DerivedDecoder.scala:13)
	at io.circe.Decoder$$anon$28.apply(Decoder.scala:178)
	at io.circe.Decoder$$anon$28.apply(Decoder.scala:178)
	at io.circe.SeqDecoder.apply(SeqDecoder.scala:17)
	at io.circe.Decoder$class.tryDecode(Decoder.scala:36)
	at io.circe.SeqDecoder.tryDecode(SeqDecoder.scala:6)
	at com.target.data_validator.ConfigParser$anon$importedDecoder$macro$81$1$$anon$10.apply(ConfigParser.scala:28)
	at io.circe.generic.decoding.DerivedDecoder$$anon$1.apply(DerivedDecoder.scala:13)
	at io.circe.Json.as(Json.scala:106)
	at com.target.data_validator.ConfigParser$.configFromJson(ConfigParser.scala:28)
	at com.target.data_validator.ConfigParser$$anonfun$parse$1.apply(ConfigParser.scala:65)
	at com.target.data_validator.ConfigParser$$anonfun$parse$1.apply(ConfigParser.scala:65)
	at cats.syntax.EitherOps$.flatMap$extension(either.scala:149)
	at com.target.data_validator.ConfigParser$.parse(ConfigParser.scala:65)
	at com.target.data_validator.ConfigParser$.parseFile(ConfigParser.scala:60)
	at com.target.data_validator.Main$.loadConfigRun(Main.scala:23)
	at com.target.data_validator.Main$.main(Main.scala:171)
	at com.target.data_validator.Main.main(Main.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.IllegalArgumentException: requirement failed: Literal must have a corresponding value to bigint, but class Integer found.
	at scala.Predef$.require(Predef.scala:281)
	at org.apache.spark.sql.catalyst.expressions.Literal$.validateLiteralValue(literals.scala:219)
	at org.apache.spark.sql.catalyst.expressions.Literal.<init>(literals.scala:296)
	at org.apache.spark.sql.catalyst.expressions.Literal$.create(literals.scala:144)
	at com.target.data_validator.validator.ValidatorBase$.<init>(ValidatorBase.scala:139)
	at com.target.data_validator.validator.ValidatorBase$.<clinit>(ValidatorBase.scala)
	... 47 more

Ran a spark-submit job as follows:

spark-submit --num-executors 10 --executor-cores 2 data-validator-assembly-20220111T034941.jar --config config.yaml

The config.yaml file has the following content:

numKeyCols: 2
numErrorsToReport: 742

tables:
  - parquetFile: /home/jyoti/Spark/userdata1.parquet
    checks:
      - type: nullCheck
        column: salary

I got the userdata1.parquet from the following github link:
https://github.com/Teradata/kylo/blob/master/samples/sample-data/parquet/userdata1.parquet

Environment Details:
latest source code: data-validator-0.13.0
Lubuntu 18.04 LTS x64 version on VMWare Player
4 CPU cores and 2GB ram
Java version

yoti@lubuntu:~$ java -version
openjdk version "1.8.0_312"
OpenJDK Runtime Environment (build 1.8.0_312-8u312-b07-0ubuntu1~18.04-b07)
OpenJDK 64-Bit Server VM (build 25.312-b07, mixed mode)

lsb_release output:

jyoti@lubuntu:~$ lsb_release -a 2>/dev/null
Distributor ID:	Ubuntu
Description:	Ubuntu 18.04 LTS
Release:	18.04
Codename:	bionic

uname -s:

jyoti@lubuntu:~$ uname -s
Linux

sbt -version:

root@lubuntu:/home/jyoti/Spark# sbt -version
downloading sbt launcher 1.6.1
[info] [launcher] getting org.scala-sbt sbt 1.6.1  (this may take some time)...
[info] [launcher] getting Scala 2.12.15 (for sbt)...
sbt version in this project: 1.6.1
sbt script version: 1.6.1

Please let me know if you need anything else.

@JyotiRSharma
Copy link
Author

JyotiRSharma commented Jan 11, 2022

But if I run the Main.scala in intelliJ on my base windows machine, it executes with no problem.

22/01/11 12:28:53 INFO Main$: Logging configured!
22/01/11 12:28:55 INFO Main$: Data Validator
22/01/11 12:28:55 INFO ConfigParser$: Parsing `D:\Spark\old\test_config.yaml`
22/01/11 12:28:55 INFO ConfigParser$: Attempting to load `D:\Spark\old\test_config.yaml` from file system
22/01/11 12:29:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/01/11 12:29:10 INFO ValidatorConfig: substituteVariables()
22/01/11 12:29:10 INFO Main$: Checking Cli Outputs htmlReport: None jsonReport: None
22/01/11 12:29:10 INFO Main$: filename: None append: false
22/01/11 12:29:10 INFO Main$: filename: None append: true
22/01/11 12:29:10 INFO ValidatorParquetFile: Reading parquet file: D:\Spark\old\DemoTime\userdata1.parquet
22/01/11 12:29:29 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
22/01/11 12:29:33 INFO Main$: Running sparkChecks
22/01/11 12:29:33 INFO ValidatorConfig: Running Quick Checks...
22/01/11 12:29:33 INFO ValidatorParquetFile: Reading parquet file: D:\Spark\old\DemoTime\userdata1.parquet
22/01/11 12:29:38 INFO ValidatorTable: Results: [1000,68]
22/01/11 12:29:38 INFO ValidatorTable: Total Rows Processed: 1000
22/01/11 12:29:38 ERROR RowBased: Quick check for NullCheck on salary failed, 68 errors in 1000 rows errorCountThreshold: 0
22/01/11 12:29:38 INFO ValidatorTable: keyColumns: registration_dttm, id
22/01/11 12:29:40 INFO ValidatorConfig: Running Costly Checks...
22/01/11 12:29:40 INFO ValidatorParquetFile: Reading parquet file: D:\Spark\old\DemoTime\userdata1.parquet
22/01/11 12:29:40 ERROR Main$: data-validator failed!
DATA_VALIDATOR_STATUS=FAIL

Process finished with exit code -1

Note: I hardcoded the config file in ConfigParser.scala

  def parseFile(filename: String, cliMap: Map[String, String]): Either[Error, ValidatorConfig] = {
    val filename = "D:\\Spark\\old\\test_config.yaml"
    logger.info(s"Parsing `$filename`")

And also, I hardcoded Spark to run locally in the Main.scala like:

  def runChecks(mainConfig: CmdLineOptions, origConfig: ValidatorConfig): (Boolean, Boolean) = {
    val varSub = new VarSubstitution
    implicit val spark = SparkSession.builder.appName("data-validator").master("local").enableHiveSupport().getOrCreate()

Environment Details of base machine:
OS: Windows 10 x64
Java version:

C:\Users\appde>java -version
java version "1.8.0_291"
Java(TM) SE Runtime Environment (build 1.8.0_291-b10)
Java HotSpot(TM) 64-Bit Server VM (build 25.291-b10, mixed mode)

sbt version:

D:\Spark\old\data-validator-master>sbt version
[info] welcome to sbt 1.5.7 (Oracle Corporation Java 1.8.0_281)

@colindean
Copy link
Collaborator

What happens when you set the threshold, e.g.

numKeyCols: 2
numErrorsToReport: 742

tables:
  - parquetFile: /home/jyoti/Spark/userdata1.parquet
    checks:
      - type: nullCheck
        column: salary
        threshold: 0
        # or
        threshold: "0"

It should be optional, though. We've almost always specified it.

@colindean
Copy link
Collaborator

Actually, I found it: /opt/spark/spark-3.1.2-bin-hadoop3.2.

DV doesn't support Spark 3 yet, so all bets are off.

But try something: apply this patch to change the literals:

From 0739e46d2c7ec01d908274aa3a83edd7263fc73a Mon Sep 17 00:00:00 2001
From: Colin Dean <[email protected]>
Date: Tue, 11 Jan 2022 11:09:50 -0500
Subject: [PATCH] Use a long literal when creating a Spark SQL literal

---
 .../com/target/data_validator/validator/ValidatorBase.scala   | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/main/scala/com/target/data_validator/validator/ValidatorBase.scala b/src/main/scala/com/target/data_validator/validator/ValidatorBase.scala
index d815a78..ef8d1eb 100644
--- a/src/main/scala/com/target/data_validator/validator/ValidatorBase.scala
+++ b/src/main/scala/com/target/data_validator/validator/ValidatorBase.scala
@@ -136,8 +136,8 @@ object ValidatorBase extends LazyLogging {
   private val backtick = "`"
   val I0: Literal = Literal.create(0, IntegerType)
   val D0: Literal = Literal.create(0.0, DoubleType)
-  val L0: Literal = Literal.create(0, LongType)
-  val L1: Literal = Literal.create(1, LongType)
+  val L0: Literal = Literal.create(0L, LongType)
+  val L1: Literal = Literal.create(1L, LongType)
 
   def isValueColumn(v: String): Boolean = v.startsWith(backtick)
 
-- 
2.34.1

@colindean
Copy link
Collaborator

We may support Spark 3 after #84.

@JyotiRSharma
Copy link
Author

Thanks Colin, I will check it out... 😃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants