Skip to content

to_csv: Handle edge cases found during fuzz testing #3232

@andygrove

Description

@andygrove

Summary

During review of PR #3004 (which adds basic to_csv support), fuzz testing revealed several edge cases that are not handled correctly. These should be addressed in follow-up work after the initial implementation is merged.

Bugs Found

1. Null value not quoted when it contains special characters

When the nullValue option contains the delimiter or other special characters (e.g., "N,A"), it's written unquoted, corrupting the CSV output.

Expected (Spark) Actual (Comet)
"N,A",world N,A,world
hello,"N,A" hello,N,A

Location: native/spark-expr/src/csv_funcs/to_csv.rs:164-171

Fix: Check if null_value contains special characters and quote/escape it appropriately.

2. Whitespace trimming applied incorrectly

When ignoreLeadingWhiteSpace=false or ignoreTrailingWhiteSpace=false, strings containing whitespace plus special characters are incorrectly handled. The code trims whitespace before checking if quoting is needed.

Expected (Spark) Actual (Comet)
\" (preserved whitespace with escaped quote) "" (empty)

Location: native/spark-expr/src/csv_funcs/to_csv.rs:176-183

Fix: Review the order of operations - quoting determination should consider the original (untrimmed) value.

3. Decimal formatting mismatch

Spark uses scientific notation for small decimal values, while Comet uses fixed-point notation.

Expected (Spark) Actual (Comet)
0E-18 0.000000000000000000

Fix: Align decimal-to-string casting with Spark's formatting behavior.

4. NPE with single-column struct (needs investigation)

NullPointerException occurs when processing single-column structs with certain null patterns. This may be a Spark-side issue with how Comet's output is handled, but needs investigation.

Reproduction

Fuzz tests were added in CometCsvExpressionSuite.scala that reproduce these issues:

  • to_csv - edge case: delimiter in null value representation
  • to_csv - fuzz test: comprehensive random data and options
  • to_csv - edge case: numeric boundary values
  • to_csv - edge case: single column struct

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions