Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finisher does not accept token if data frame comes from disk - not memeory #1165

Open
ibychkov007 opened this issue Nov 16, 2020 · 0 comments
Assignees
Labels

Comments

@ibychkov007
Copy link

Description

Finisher seems to be very strict about input data comparing to embedding annotators.
It has check at

require(schema(annotationColumn).dataType == ArrayType(Annotation.dataType),

Problem occurs when results of tokenizer being saved as parquet and later being read to be transformed by Finisher. Spark does not restore value for field flag “nullable” after token field being read. In particular “begin”,“end” and “EMBEDDINGS”. It caused the check to fail. BTW, this behavior of Spark is by desing.
Strangely enough, when the same parquet files being used to feed Bert Embedding it works like a charm.
I’ve found workaround – replace schema for read dataframe with Annotation.dataType and copy metadata for token field.
Is there better way to address this issue? May be existing utility class to fix dataframe schema ? Or may be Finisher should work in the same fashion as Embedding and has more relaxed set of checks?

Expected Behavior

Finisher should take into consideration that Spark does not restore nullable flag for fields after data was saved as parquet

Current Behavior

Finisher rejects data read from parquet files

Possible Solution

Steps to Reproduce

val someData = Seq(
Row("Task of testing is not simple"), Row("Single"), Row(""))

val someSchema = List(
  StructField("text", StringType, true))

val someDF = spark.createDataFrame(
  spark.sparkContext.parallelize(someData),
  StructType(someSchema))

someDF.printSchema()
someDF.show(false)

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")
  .setCleanupMode("shrink")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer))

val inMemoryDF = pipeline.fit(someDF).transform(someDF)

println("==================     InMemory DF")
inMemoryDF.printSchema()

inMemoryDF.write.mode("overwrite").parquet("./spark_nlp_finisher")

val fromDiskDF = spark.read.parquet("./spark_nlp_finisher")
println("==================     From Disk DF")
fromDiskDF.printSchema()

val finisher = new Finisher().setInputCols("document", "token").setOutputCols("document", "token").setOutputAsArray(true).setCleanAnnotations(true)

val model = new Pipeline().setStages(Array(finisher)).fit(inMemoryDF)

println("==================     InMemory DF: FINISHED")

model.transform(inMemoryDF).show(false)

println("==================     FromDISK DF: FINISHED")

model.transform(fromDiskDF).show(false)

================== InMemory DF: FINISHED
+-----------------------------+-------------------------------+------------------------------------+
|text |document |token |
+-----------------------------+-------------------------------+------------------------------------+
|Task of testing is not simple|[Task of testing is not simple]|[Task, of, testing, is, not, simple]|
|Single |[Single] |[Single] |
| |[] |[] |
+-----------------------------+-------------------------------+------------------------------------+

================== FromDISK DF: FINISHED
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: column [document] must be an NLP Annotation column
at scala.Predef$.require(Predef.scala:224)
at com.johnsnowlabs.nlp.Finisher$$anonfun$transformSchema$2.apply(Finisher.scala:68)
at com.johnsnowlabs.nlp.Finisher$$anonfun$transformSchema$2.apply(Finisher.scala:62)

Context

Your Environment

  • Spark NLP version: 2.6.3
  • Apache NLP version:
  • Java version (java -version):ava(TM) SE Runtime Environment (build 1.8.0_192-b12)
  • Setup and installation (Pypi, Conda, Maven, etc.):
  • Operating System and version:Darwin Kernel Version 19.6.0
  • Link to your project (if any):Darwin Kernel Version 19.6.0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants