-
Notifications
You must be signed in to change notification settings - Fork 272
Feat: to_csv #3004
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Feat: to_csv #3004
Conversation
| CsvWriteOptions options = 2; | ||
| } | ||
|
|
||
| message CsvWriteOptions { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3004 +/- ##
============================================
+ Coverage 56.12% 60.02% +3.89%
- Complexity 976 1418 +442
============================================
Files 119 170 +51
Lines 11743 15742 +3999
Branches 2251 2598 +347
============================================
+ Hits 6591 9449 +2858
- Misses 4012 4976 +964
- Partials 1140 1317 +177 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
parthchandra
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Sorry for the delay in reviewing). This looks pretty good to me, pending ci.
Also a minor comment on escaping. Can you confirm that this behaviour is consistent with Spark?
| fn escape_value(value: &str, quote: &str, escape: &str, output: &mut String) { | ||
| for ch in value.chars() { | ||
| let ch_str = ch.to_string(); | ||
| if ch_str == quote || ch_str == escape { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CSV spec does not have any special escape for escape and the preferred way to escape the double quote is another double quote (but only if the string is enclosed in a double quote!). - https://datatracker.ietf.org/doc/html/rfc4180#section-2
Not sure what Spark does here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This necessary for case:
sql(s"""insert into $table values('a\\\\b')""")checkSparkAnswerAndOperator(
df.select(to_csv(struct(col("col"), lit(1)), Map("quoteAll" -> "true").asJava)))This may seem insignificant, but I want to cover the maximum number of edge cases.
| let needs_quoting = write_options.quote_all | ||
| || (is_string_field | ||
| && !string_arrays[col_idx].is_null(row_idx) | ||
| && (value.contains(&write_options.delimiter) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If not ignoring leading/trailing whitespace then the string should be quoted if there is a leading/trailing whitespace.
Also if there are newlines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark don't quoting string with leading or trailing whitespaces if quoteAll option is disabled. This edge case is covered by unit tests
sql(s"insert` into $table values(' abc ')")checkSparkAnswerAndOperator(
df.select(
to_csv(
struct(col("col"), lit(1)),
Map(
"delimiter" -> ";",
"ignoreLeadingWhiteSpace" -> "false",
"ignoreTrailingWhiteSpace" -> "false").asJava)))There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For case of newlines you are absolutley right - fixed this.
|
Tanks @parthchandra for the review. I'll try to resolve conversions them tomorrow. |
|
@parthchandra could you take another look when you have time? |
|
|
||
| impl Display for CsvWriteOptions { | ||
| fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result { | ||
| write!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should delimiter also be shown here?
| s"The schema ${expr.inputSchema} is not supported because " + | ||
| s"it includes a incompatible data types: $incompatibleDataTypes")) | ||
| } | ||
| Compatible() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we guarantee full compatibility with the UnivocityGenerator that Spark uses? Perhaps this should be marked as incompatible for now until we have sufficient fuzz testing to confirm compatibility?
| let quote_char = write_options.quote.chars().next().unwrap_or('"'); | ||
| let escape_char = write_options.escape.chars().next().unwrap_or('\\'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does Spark limit these to chars, or can they be multi-character strings?
|
@kazantsev-maksim This looks great, thank you! My only concern is that we are claiming full compatibility with Spark and I'm not sure that the tests are comprehensive enough to prove that. I am going to do some testing with this PR and see if I can suggest more tests to add. |
andygrove
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kazantsev-maksim. I did find some compatibility issues, but they are mostly edge cases. I filed an issue to look at these in the future: #3232
Which issue does this PR close?
Rationale for this change
Basic implementation of the spark to_csv function added - https://spark.apache.org/docs/latest/api/sql/index.html#to_csv
What changes are included in this PR?
How are these changes tested?
Benchmark results (need optimization):
