Skip to content

Conversation

@kazantsev-maksim
Copy link
Contributor

@kazantsev-maksim kazantsev-maksim commented Dec 28, 2025

Which issue does this PR close?

  • N/A

Rationale for this change

Basic implementation of the spark to_csv function added - https://spark.apache.org/docs/latest/api/sql/index.html#to_csv

  1. Handling of complex types must be implemented in a future iteration.
  2. The processing of types such as DateType, TimestampType, TimestampNTZType, and BinaryType is currently inconsistent with Spark's behavior.

What changes are included in this PR?

How are these changes tested?

  1. Added unit tests
  2. Added benchmark tests

Benchmark results (need optimization):
to_csv_benchmark_result

@kazantsev-maksim kazantsev-maksim marked this pull request as draft December 28, 2025 13:38
@kazantsev-maksim kazantsev-maksim marked this pull request as ready for review January 9, 2026 10:56
CsvWriteOptions options = 2;
}

message CsvWriteOptions {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@codecov-commenter
Copy link

codecov-commenter commented Jan 13, 2026

Codecov Report

❌ Patch coverage is 90.24390% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 60.02%. Comparing base (f09f8af) to head (cea1f1a).
⚠️ Report is 859 commits behind head on main.

Files with missing lines Patch % Lines
...rc/main/scala/org/apache/comet/serde/structs.scala 89.74% 3 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #3004      +/-   ##
============================================
+ Coverage     56.12%   60.02%   +3.89%     
- Complexity      976     1418     +442     
============================================
  Files           119      170      +51     
  Lines         11743    15742    +3999     
  Branches       2251     2598     +347     
============================================
+ Hits           6591     9449    +2858     
- Misses         4012     4976     +964     
- Partials       1140     1317     +177     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@parthchandra parthchandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Sorry for the delay in reviewing). This looks pretty good to me, pending ci.
Also a minor comment on escaping. Can you confirm that this behaviour is consistent with Spark?

fn escape_value(value: &str, quote: &str, escape: &str, output: &mut String) {
for ch in value.chars() {
let ch_str = ch.to_string();
if ch_str == quote || ch_str == escape {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CSV spec does not have any special escape for escape and the preferred way to escape the double quote is another double quote (but only if the string is enclosed in a double quote!). - https://datatracker.ietf.org/doc/html/rfc4180#section-2
Not sure what Spark does here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This necessary for case:

sql(s"""insert into $table values('a\\\\b')""")
checkSparkAnswerAndOperator(
          df.select(to_csv(struct(col("col"), lit(1)), Map("quoteAll" -> "true").asJava)))

This may seem insignificant, but I want to cover the maximum number of edge cases.

let needs_quoting = write_options.quote_all
|| (is_string_field
&& !string_arrays[col_idx].is_null(row_idx)
&& (value.contains(&write_options.delimiter)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not ignoring leading/trailing whitespace then the string should be quoted if there is a leading/trailing whitespace.
Also if there are newlines.

Copy link
Contributor Author

@kazantsev-maksim kazantsev-maksim Jan 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark don't quoting string with leading or trailing whitespaces if quoteAll option is disabled. This edge case is covered by unit tests

sql(s"insert` into $table values('  abc   ')")
checkSparkAnswerAndOperator(
          df.select(
            to_csv(
              struct(col("col"), lit(1)),
              Map(
                "delimiter" -> ";",
                "ignoreLeadingWhiteSpace" -> "false",
                "ignoreTrailingWhiteSpace" -> "false").asJava)))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For case of newlines you are absolutley right - fixed this.

@kazantsev-maksim
Copy link
Contributor Author

Tanks @parthchandra for the review. I'll try to resolve conversions them tomorrow.

@kazantsev-maksim
Copy link
Contributor Author

@parthchandra could you take another look when you have time?


impl Display for CsvWriteOptions {
fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result {
write!(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should delimiter also be shown here?

s"The schema ${expr.inputSchema} is not supported because " +
s"it includes a incompatible data types: $incompatibleDataTypes"))
}
Compatible()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we guarantee full compatibility with the UnivocityGenerator that Spark uses? Perhaps this should be marked as incompatible for now until we have sufficient fuzz testing to confirm compatibility?

Comment on lines +153 to +154
let quote_char = write_options.quote.chars().next().unwrap_or('"');
let escape_char = write_options.escape.chars().next().unwrap_or('\\');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Spark limit these to chars, or can they be multi-character strings?

@andygrove
Copy link
Member

@kazantsev-maksim This looks great, thank you! My only concern is that we are claiming full compatibility with Spark and I'm not sure that the tests are comprehensive enough to prove that. I am going to do some testing with this PR and see if I can suggest more tests to add.

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kazantsev-maksim. I did find some compatibility issues, but they are mostly edge cases. I filed an issue to look at these in the future: #3232

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants