-
Notifications
You must be signed in to change notification settings - Fork 272
Description
Summary
During review of PR #3004 (which adds basic to_csv support), fuzz testing revealed several edge cases that are not handled correctly. These should be addressed in follow-up work after the initial implementation is merged.
Bugs Found
1. Null value not quoted when it contains special characters
When the nullValue option contains the delimiter or other special characters (e.g., "N,A"), it's written unquoted, corrupting the CSV output.
| Expected (Spark) | Actual (Comet) |
|---|---|
"N,A",world |
N,A,world |
hello,"N,A" |
hello,N,A |
Location: native/spark-expr/src/csv_funcs/to_csv.rs:164-171
Fix: Check if null_value contains special characters and quote/escape it appropriately.
2. Whitespace trimming applied incorrectly
When ignoreLeadingWhiteSpace=false or ignoreTrailingWhiteSpace=false, strings containing whitespace plus special characters are incorrectly handled. The code trims whitespace before checking if quoting is needed.
| Expected (Spark) | Actual (Comet) |
|---|---|
\" (preserved whitespace with escaped quote) |
"" (empty) |
Location: native/spark-expr/src/csv_funcs/to_csv.rs:176-183
Fix: Review the order of operations - quoting determination should consider the original (untrimmed) value.
3. Decimal formatting mismatch
Spark uses scientific notation for small decimal values, while Comet uses fixed-point notation.
| Expected (Spark) | Actual (Comet) |
|---|---|
0E-18 |
0.000000000000000000 |
Fix: Align decimal-to-string casting with Spark's formatting behavior.
4. NPE with single-column struct (needs investigation)
NullPointerException occurs when processing single-column structs with certain null patterns. This may be a Spark-side issue with how Comet's output is handled, but needs investigation.
Reproduction
Fuzz tests were added in CometCsvExpressionSuite.scala that reproduce these issues:
to_csv - edge case: delimiter in null value representationto_csv - fuzz test: comprehensive random data and optionsto_csv - edge case: numeric boundary valuesto_csv - edge case: single column struct
Related
- PR Feat: to_csv #3004 - Initial
to_csvimplementation