You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Problem occurs when results of tokenizer being saved as parquet and later being read to be transformed by Finisher. Spark does not restore value for field flag “nullable” after token field being read. In particular “begin”,“end” and “EMBEDDINGS”. It caused the check to fail. BTW, this behavior of Spark is by desing.
Strangely enough, when the same parquet files being used to feed Bert Embedding it works like a charm.
I’ve found workaround – replace schema for read dataframe with Annotation.dataType and copy metadata for token field.
Is there better way to address this issue? May be existing utility class to fix dataframe schema ? Or may be Finisher should work in the same fashion as Embedding and has more relaxed set of checks?
Expected Behavior
Finisher should take into consideration that Spark does not restore nullable flag for fields after data was saved as parquet
Current Behavior
Finisher rejects data read from parquet files
Possible Solution
Steps to Reproduce
val someData = Seq(
Row("Task of testing is not simple"), Row("Single"), Row(""))
val someSchema = List(
StructField("text", StringType, true))
val someDF = spark.createDataFrame(
spark.sparkContext.parallelize(someData),
StructType(someSchema))
someDF.printSchema()
someDF.show(false)
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
.setCleanupMode("shrink")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer))
val inMemoryDF = pipeline.fit(someDF).transform(someDF)
println("================== InMemory DF")
inMemoryDF.printSchema()
inMemoryDF.write.mode("overwrite").parquet("./spark_nlp_finisher")
val fromDiskDF = spark.read.parquet("./spark_nlp_finisher")
println("================== From Disk DF")
fromDiskDF.printSchema()
val finisher = new Finisher().setInputCols("document", "token").setOutputCols("document", "token").setOutputAsArray(true).setCleanAnnotations(true)
val model = new Pipeline().setStages(Array(finisher)).fit(inMemoryDF)
println("================== InMemory DF: FINISHED")
model.transform(inMemoryDF).show(false)
println("================== FromDISK DF: FINISHED")
model.transform(fromDiskDF).show(false)
================== InMemory DF: FINISHED
+-----------------------------+-------------------------------+------------------------------------+
|text |document |token |
+-----------------------------+-------------------------------+------------------------------------+
|Task of testing is not simple|[Task of testing is not simple]|[Task, of, testing, is, not, simple]|
|Single |[Single] |[Single] |
| |[] |[] |
+-----------------------------+-------------------------------+------------------------------------+
================== FromDISK DF: FINISHED
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: column [document] must be an NLP Annotation column
at scala.Predef$.require(Predef.scala:224)
at com.johnsnowlabs.nlp.Finisher$$anonfun$transformSchema$2.apply(Finisher.scala:68)
at com.johnsnowlabs.nlp.Finisher$$anonfun$transformSchema$2.apply(Finisher.scala:62)
Context
Your Environment
Spark NLP version: 2.6.3
Apache NLP version:
Java version (java -version):ava(TM) SE Runtime Environment (build 1.8.0_192-b12)
Setup and installation (Pypi, Conda, Maven, etc.):
Operating System and version:Darwin Kernel Version 19.6.0
Link to your project (if any):Darwin Kernel Version 19.6.0
The text was updated successfully, but these errors were encountered:
Description
Finisher seems to be very strict about input data comparing to embedding annotators.
It has check at
spark-nlp/src/main/scala/com/johnsnowlabs/nlp/Finisher.scala
Line 68 in a36aeb3
Problem occurs when results of tokenizer being saved as parquet and later being read to be transformed by Finisher. Spark does not restore value for field flag “nullable” after token field being read. In particular “begin”,“end” and “EMBEDDINGS”. It caused the check to fail. BTW, this behavior of Spark is by desing.
Strangely enough, when the same parquet files being used to feed Bert Embedding it works like a charm.
I’ve found workaround – replace schema for read dataframe with Annotation.dataType and copy metadata for token field.
Is there better way to address this issue? May be existing utility class to fix dataframe schema ? Or may be Finisher should work in the same fashion as Embedding and has more relaxed set of checks?
Expected Behavior
Finisher should take into consideration that Spark does not restore nullable flag for fields after data was saved as parquet
Current Behavior
Finisher rejects data read from parquet files
Possible Solution
Steps to Reproduce
val someData = Seq(
Row("Task of testing is not simple"), Row("Single"), Row(""))
================== InMemory DF: FINISHED
+-----------------------------+-------------------------------+------------------------------------+
|text |document |token |
+-----------------------------+-------------------------------+------------------------------------+
|Task of testing is not simple|[Task of testing is not simple]|[Task, of, testing, is, not, simple]|
|Single |[Single] |[Single] |
| |[] |[] |
+-----------------------------+-------------------------------+------------------------------------+
================== FromDISK DF: FINISHED
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: column [document] must be an NLP Annotation column
at scala.Predef$.require(Predef.scala:224)
at com.johnsnowlabs.nlp.Finisher$$anonfun$transformSchema$2.apply(Finisher.scala:68)
at com.johnsnowlabs.nlp.Finisher$$anonfun$transformSchema$2.apply(Finisher.scala:62)
Context
Your Environment
The text was updated successfully, but these errors were encountered: