Feat: to_csv #3004

kazantsev-maksim · 2025-12-28T13:38:39Z

Which issue does this PR close?

N/A

Rationale for this change

Basic implementation of the spark to_csv function added - https://spark.apache.org/docs/latest/api/sql/index.html#to_csv

Handling of complex types must be implemented in a future iteration.
The processing of types such as DateType, TimestampType, TimestampNTZType, and BinaryType is currently inconsistent with Spark's behavior.

What changes are included in this PR?

How are these changes tested?

Added unit tests
Added benchmark tests

Benchmark results (need optimization):

This reverts commit 768b3e9.

kazantsev-maksim · 2026-01-09T14:38:22Z

native/proto/src/proto/expr.proto

+  CsvWriteOptions options = 2;
+}
+
+message CsvWriteOptions {


All settings: https://spark.apache.org/docs/latest/sql-data-sources-csv.html

codecov-commenter · 2026-01-13T21:15:01Z

Codecov Report

❌ Patch coverage is 90.24390% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 60.02%. Comparing base (f09f8af) to head (cea1f1a).
⚠️ Report is 859 commits behind head on main.

Files with missing lines	Patch %	Lines
...rc/main/scala/org/apache/comet/serde/structs.scala	89.74%	3 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #3004      +/-   ##
============================================
+ Coverage     56.12%   60.02%   +3.89%     
- Complexity      976     1418     +442     
============================================
  Files           119      170      +51     
  Lines         11743    15742    +3999     
  Branches       2251     2598     +347     
============================================
+ Hits           6591     9449    +2858     
- Misses         4012     4976     +964     
- Partials       1140     1317     +177

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

parthchandra

(Sorry for the delay in reviewing). This looks pretty good to me, pending ci.
Also a minor comment on escaping. Can you confirm that this behaviour is consistent with Spark?

parthchandra · 2026-01-13T21:12:14Z

native/spark-expr/src/csv_funcs/to_csv.rs

+fn escape_value(value: &str, quote: &str, escape: &str, output: &mut String) {
+    for ch in value.chars() {
+        let ch_str = ch.to_string();
+        if ch_str == quote || ch_str == escape {


The CSV spec does not have any special escape for escape and the preferred way to escape the double quote is another double quote (but only if the string is enclosed in a double quote!). - https://datatracker.ietf.org/doc/html/rfc4180#section-2
Not sure what Spark does here.

This necessary for case:

sql(s"""insert into $table values('a\\\\b')""")

checkSparkAnswerAndOperator( df.select(to_csv(struct(col("col"), lit(1)), Map("quoteAll" -> "true").asJava)))

This may seem insignificant, but I want to cover the maximum number of edge cases.

parthchandra · 2026-01-14T19:29:23Z

native/spark-expr/src/csv_funcs/to_csv.rs

+                let needs_quoting = write_options.quote_all
+                    || (is_string_field
+                        && !string_arrays[col_idx].is_null(row_idx)
+                        && (value.contains(&write_options.delimiter)


If not ignoring leading/trailing whitespace then the string should be quoted if there is a leading/trailing whitespace.
Also if there are newlines.

Spark don't quoting string with leading or trailing whitespaces if quoteAll option is disabled. This edge case is covered by unit tests

sql(s"insert` into $table values(' abc ')")

checkSparkAnswerAndOperator( df.select( to_csv( struct(col("col"), lit(1)), Map( "delimiter" -> ";", "ignoreLeadingWhiteSpace" -> "false", "ignoreTrailingWhiteSpace" -> "false").asJava)))

For case of newlines you are absolutley right - fixed this.

kazantsev-maksim · 2026-01-15T13:47:24Z

Tanks @parthchandra for the review. I'll try to resolve conversions them tomorrow.

kazantsev-maksim · 2026-01-17T10:34:14Z

@parthchandra could you take another look when you have time?

andygrove · 2026-01-21T14:03:15Z

native/spark-expr/src/csv_funcs/csv_write_options.rs

+
+impl Display for CsvWriteOptions {
+    fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result {
+        write!(


nit: should delimiter also be shown here?

andygrove · 2026-01-21T14:06:15Z

spark/src/main/scala/org/apache/comet/serde/structs.scala

+          s"The schema ${expr.inputSchema} is not supported because " +
+            s"it includes a incompatible data types: $incompatibleDataTypes"))
+    }
+    Compatible()


Can we guarantee full compatibility with the UnivocityGenerator that Spark uses? Perhaps this should be marked as incompatible for now until we have sufficient fuzz testing to confirm compatibility?

andygrove · 2026-01-21T14:08:46Z

native/spark-expr/src/csv_funcs/to_csv.rs

+    let quote_char = write_options.quote.chars().next().unwrap_or('"');
+    let escape_char = write_options.escape.chars().next().unwrap_or('\\');


Does Spark limit these to chars, or can they be multi-character strings?

andygrove · 2026-01-21T14:10:35Z

@kazantsev-maksim This looks great, thank you! My only concern is that we are claiming full compatibility with Spark and I'm not sure that the tests are comprehensive enough to prove that. I am going to do some testing with this PR and see if I can suggest more tests to add.

andygrove

Thanks @kazantsev-maksim. I did find some compatibility issues, but they are mostly edge cases. I filed an issue to look at these in the future: #3232

Kazantsev Maksim and others added 7 commits December 14, 2025 16:24

impl map_from_entries

768b3e9

Revert "impl map_from_entries"

c68c342

This reverts commit 768b3e9.

Merge branch 'apache:main' into main

d887555

Merge branch 'apache:main' into main

231aa90

Merge branch 'apache:main' into main

9500bbb

work

5d153e9

WIP

f0f03d4

kazantsev-maksim marked this pull request as draft December 28, 2025 13:38

Kazantsev Maksim and others added 16 commits December 28, 2025 17:40

WIP

4b02dd6

Merge branch 'apache:main' into main

9577481

Add benchmark test

0f98a3c

WIP

d7a6036

Merge branch 'apache:main' into main

3791557

Merge branch 'apache:main' into main

7c2f082

Merge branch 'apache:main' into main

609a605

Merge branch 'apache:main' into main

a151b2c

Merge remote-tracking branch 'origin/main' into to_csv

e89c8f2

Add benchmark

902eb3a

add more options

86c17e8

Revert

1bbc314

Work

55388da

Work

c93d256

Fix tests

3a51b62

Fix tests

93458cf

kazantsev-maksim marked this pull request as ready for review January 9, 2026 10:56

Fix clippy warnings

773aaba

kazantsev-maksim commented Jan 9, 2026

View reviewed changes

Kazantsev Maksim and others added 2 commits January 9, 2026 19:15

Fix tests

cf544c7

Merge branch 'apache:main' into main

ad3e7f5

parthchandra approved these changes Jan 13, 2026

View reviewed changes

Merge branch 'apache:main' into main

ea92e4b

parthchandra reviewed Jan 14, 2026

View reviewed changes

kazantsev-maksim and others added 3 commits January 17, 2026 09:50

Merge branch 'apache:main' into main

8dfeca3

Merge remote-tracking branch 'origin/main' into to_csv

53a1418

Add more edge cases

cea1f1a

andygrove reviewed Jan 21, 2026

View reviewed changes

andygrove mentioned this pull request Jan 21, 2026

to_csv: Handle edge cases found during fuzz testing #3232

Open

andygrove approved these changes Jan 21, 2026

View reviewed changes

		let quote_char = write_options.quote.chars().next().unwrap_or('"');
		let escape_char = write_options.escape.chars().next().unwrap_or('\\');

Feat: to_csv #3004

Are you sure you want to change the base?

Feat: to_csv #3004

Uh oh!

Conversation

kazantsev-maksim commented Dec 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

parthchandra left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kazantsev-maksim Jan 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kazantsev-maksim commented Jan 15, 2026

Uh oh!

kazantsev-maksim commented Jan 17, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andygrove commented Jan 21, 2026

Uh oh!

andygrove left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kazantsev-maksim commented Dec 28, 2025 •

edited

Loading

codecov-commenter commented Jan 13, 2026 •

edited

Loading

kazantsev-maksim Jan 17, 2026 •

edited

Loading