Numbers are parsed to numbers when enforceSchema is set to true and type being String #747

FlavioShneider · 2023-06-12T07:53:58Z

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

one column value can be large numbers like this
1010675
note:

enforceSchema is set to true;
the schema type was specified as String;
usePlainNumberFormat = true
this is similar to , but still different to an existing issue(Large Numbers Being Captured As Scientific Notation #126).
I worry the title is different and the problem might be ignored, so raised here with explict problem title.

the current read/parse result is:
1010675.0

Expected Behavior

1010675 string not transformed

Steps To Reproduce

No response

Environment

- Spark version: 3.2.2
- Spark-Excel version: 3.2.2_0.18.5
- OS:
- Cluster environment

Anything else?

No response

nightscape · 2023-06-12T08:29:44Z

Hi @FlavioShneider, can you post the exact code, and in the best case also the Excel file to reproduce the issue?

FlavioShneider · 2023-06-12T14:58:10Z

Input
Test1.xlsx

Code
The code is written in the Azure Databricks (10.4 lts) notebooks style
Scala

import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.{DataFrame, DataFrameReader}


val schema = s"`productId` STRING, " +
    s"`price` DOUBLE "

def inferOrEnforceSchema(
  schema: Option[String],
  reader: DataFrameReader,
  path: String
): DataFrame = {
	val df = schema match {
	  case None         => reader.option("inferSchema", "true").load(path)
	  case Some(schema) => reader.option("enforceSchema", "true").schema(schema).load(path)
	}
	df
}

val excelReader = spark.read
  .format("excel")
  .option(
    "dataAddress",
    "0!A3"
  ) 
  .option(
    "usePlainNumberFormat",
    true
  )
  .option(
    "treatEmptyValuesAsNulls",
    true
  ) 
  .option("excerptSize", 30)
  .option("header", "true")

val result = inferOrEnforceSchema(Some(schema), excelReader, datasetPath)

display(result)

Output
export (4).csv

nightscape · 2023-06-14T07:41:23Z

I think I know what's happening: The productId is specified as a STRING in the schema, but stored as a number in the Excel file, so we run into this line where we call a method to convert the number to a String.
I'm not sure how we could infer the formatting such that the String looks like an integer at the end.
If you're familiar with Scala debugging, you could try to set a breakpoint at one of the lines above and see what additional information is available.
Otherwise, you could also just make the column a Text in Excel if that's an option.

coryvisi · 2023-07-18T16:24:48Z

Could this behavior be triggered if inferSchema=true and a column has mixed values (some string and some numeric, some produced by formulas)?

In looking at the code, it looks like DataFormatter is used when usePlainNumberFormat is set to true, which should be a safe conversion, but I'm seeing behavior that converts number values to scientific notation when there are formulas and mixed values in the column.

Spark 3.1.1
Spark-Excel version: 0.18.7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Numbers are parsed to numbers when enforceSchema is set to true and type being String #747

Numbers are parsed to numbers when enforceSchema is set to true and type being String #747

FlavioShneider commented Jun 12, 2023

nightscape commented Jun 12, 2023

FlavioShneider commented Jun 12, 2023 •

edited

Loading

nightscape commented Jun 14, 2023

coryvisi commented Jul 18, 2023

Numbers are parsed to numbers when enforceSchema is set to true and type being String #747

Numbers are parsed to numbers when enforceSchema is set to true and type being String #747

Comments

FlavioShneider commented Jun 12, 2023

Is there an existing issue for this?

Current Behavior

Expected Behavior

Steps To Reproduce

Environment

Anything else?

nightscape commented Jun 12, 2023

FlavioShneider commented Jun 12, 2023 • edited Loading

nightscape commented Jun 14, 2023

coryvisi commented Jul 18, 2023

FlavioShneider commented Jun 12, 2023 •

edited

Loading