Finisher does not accept token if data frame comes from disk - not memeory #1165

ibychkov007 · 2020-11-16T17:33:04Z

Description

Finisher seems to be very strict about input data comparing to embedding annotators.
It has check at

spark-nlp/src/main/scala/com/johnsnowlabs/nlp/Finisher.scala

Line 68 in a36aeb3

require(schema(annotationColumn).dataType == ArrayType(Annotation.dataType),

Problem occurs when results of tokenizer being saved as parquet and later being read to be transformed by Finisher. Spark does not restore value for field flag “nullable” after token field being read. In particular “begin”,“end” and “EMBEDDINGS”. It caused the check to fail. BTW, this behavior of Spark is by desing.
Strangely enough, when the same parquet files being used to feed Bert Embedding it works like a charm.
I’ve found workaround – replace schema for read dataframe with Annotation.dataType and copy metadata for token field.
Is there better way to address this issue? May be existing utility class to fix dataframe schema ? Or may be Finisher should work in the same fashion as Embedding and has more relaxed set of checks?

Expected Behavior

Finisher should take into consideration that Spark does not restore nullable flag for fields after data was saved as parquet

Current Behavior

Finisher rejects data read from parquet files

Possible Solution

Steps to Reproduce

val someData = Seq(
Row("Task of testing is not simple"), Row("Single"), Row(""))

val someSchema = List(
  StructField("text", StringType, true))

val someDF = spark.createDataFrame(
  spark.sparkContext.parallelize(someData),
  StructType(someSchema))

someDF.printSchema()
someDF.show(false)

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")
  .setCleanupMode("shrink")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer))

val inMemoryDF = pipeline.fit(someDF).transform(someDF)

println("==================     InMemory DF")
inMemoryDF.printSchema()

inMemoryDF.write.mode("overwrite").parquet("./spark_nlp_finisher")

val fromDiskDF = spark.read.parquet("./spark_nlp_finisher")
println("==================     From Disk DF")
fromDiskDF.printSchema()

val finisher = new Finisher().setInputCols("document", "token").setOutputCols("document", "token").setOutputAsArray(true).setCleanAnnotations(true)

val model = new Pipeline().setStages(Array(finisher)).fit(inMemoryDF)

println("==================     InMemory DF: FINISHED")

model.transform(inMemoryDF).show(false)

println("==================     FromDISK DF: FINISHED")

model.transform(fromDiskDF).show(false)

================== FromDISK DF: FINISHED
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: column [document] must be an NLP Annotation column
at scala.Predef$.require(Predef.scala:224)
at com.johnsnowlabs.nlp.Finisher$$anonfun$transformSchema$2.apply(Finisher.scala:68)
at com.johnsnowlabs.nlp.Finisher$$anonfun$transformSchema$2.apply(Finisher.scala:62)

Context

Your Environment

Spark NLP version: 2.6.3
Apache NLP version:
Java version (java -version):ava(TM) SE Runtime Environment (build 1.8.0_192-b12)
Setup and installation (Pypi, Conda, Maven, etc.):
Operating System and version:Darwin Kernel Version 19.6.0
Link to your project (if any):Darwin Kernel Version 19.6.0

The text was updated successfully, but these errors were encountered:

ibychkov007 assigned maziyarpanahi Nov 16, 2020

maziyarpanahi added the bug label Nov 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finisher does not accept token if data frame comes from disk - not memeory #1165

Finisher does not accept token if data frame comes from disk - not memeory #1165

ibychkov007 commented Nov 16, 2020

Finisher does not accept token if data frame comes from disk - not memeory #1165

Finisher does not accept token if data frame comes from disk - not memeory #1165

Comments

ibychkov007 commented Nov 16, 2020

Description

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce

Context

Your Environment