Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Numbers are parsed to numbers when enforceSchema is set to true and type being String #747

Open
1 task done
FlavioShneider opened this issue Jun 12, 2023 · 4 comments
Open
1 task done

Comments

@FlavioShneider
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

one column value can be large numbers like this
1010675
note:

  • enforceSchema is set to true;
  • the schema type was specified as String;
  • usePlainNumberFormat = true
  • this is similar to , but still different to an existing issue(Large Numbers Being Captured As Scientific Notation #126).
    I worry the title is different and the problem might be ignored, so raised here with explict problem title.

the current read/parse result is:
1010675.0

Expected Behavior

1010675 string not transformed

Steps To Reproduce

No response

Environment

- Spark version: 3.2.2
- Spark-Excel version: 3.2.2_0.18.5
- OS:
- Cluster environment

Anything else?

No response

@nightscape
Copy link
Collaborator

Hi @FlavioShneider, can you post the exact code, and in the best case also the Excel file to reproduce the issue?

@FlavioShneider
Copy link
Author

FlavioShneider commented Jun 12, 2023

Input
Test1.xlsx

Code
The code is written in the Azure Databricks (10.4 lts) notebooks style
Scala

import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.{DataFrame, DataFrameReader}


val schema = s"`productId` STRING, " +
    s"`price` DOUBLE "

def inferOrEnforceSchema(
  schema: Option[String],
  reader: DataFrameReader,
  path: String
): DataFrame = {
	val df = schema match {
	  case None         => reader.option("inferSchema", "true").load(path)
	  case Some(schema) => reader.option("enforceSchema", "true").schema(schema).load(path)
	}
	df
}

val excelReader = spark.read
  .format("excel")
  .option(
    "dataAddress",
    "0!A3"
  ) 
  .option(
    "usePlainNumberFormat",
    true
  )
  .option(
    "treatEmptyValuesAsNulls",
    true
  ) 
  .option("excerptSize", 30)
  .option("header", "true")

val result = inferOrEnforceSchema(Some(schema), excelReader, datasetPath)

display(result)

Output
export (4).csv

@nightscape
Copy link
Collaborator

I think I know what's happening: The productId is specified as a STRING in the schema, but stored as a number in the Excel file, so we run into this line where we call a method to convert the number to a String.
I'm not sure how we could infer the formatting such that the String looks like an integer at the end.
If you're familiar with Scala debugging, you could try to set a breakpoint at one of the lines above and see what additional information is available.
Otherwise, you could also just make the column a Text in Excel if that's an option.

@coryvisi
Copy link

Could this behavior be triggered if inferSchema=true and a column has mixed values (some string and some numeric, some produced by formulas)?

In looking at the code, it looks like DataFormatter is used when usePlainNumberFormat is set to true, which should be a safe conversion, but I'm seeing behavior that converts number values to scientific notation when there are formulas and mixed values in the column.

Spark 3.1.1
Spark-Excel version: 0.18.7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants