Skip to content

[SPARK-56414][SQL] Per-write options should take precedence over session config in file source writes#55280

Open
cloud-fan wants to merge 4 commits intoapache:masterfrom
cloud-fan:fix-parquet-write-option-priority
Open

[SPARK-56414][SQL] Per-write options should take precedence over session config in file source writes#55280
cloud-fan wants to merge 4 commits intoapache:masterfrom
cloud-fan:fix-parquet-write-option-priority

Conversation

@cloud-fan
Copy link
Copy Markdown
Contributor

@cloud-fan cloud-fan commented Apr 9, 2026

What changes were proposed in this pull request?

In Parquet and Avro prepareWrite, several Hadoop configuration keys are unconditionally set from session-level SQLConf, silently overwriting any per-write options the user provided. Additionally, some write paths (FileStreamSink, InsertIntoHiveTable) create the Hadoop conf via newHadoopConf() without merging write options at all, so per-write options never reach the conf.

This PR fixes both issues:

  1. FileFormatWriter.write: Merges write options into the Job's Hadoop conf before calling prepareWrite. This is the central fix that ensures per-write options are in the conf regardless of how the caller created it. Handles CaseInsensitiveMap key lowercasing by using the original map keys.

  2. ParquetUtils.prepareWrite: Uses DataSourceUtils.setConfIfAbsent so that SQLConf defaults are only applied when the key is not already present in the conf (i.e., no per-write option was provided). Affected keys:

    • spark.sql.parquet.writeLegacyFormat
    • spark.sql.parquet.outputTimestampType
    • spark.sql.parquet.fieldId.write.enabled
    • spark.sql.legacy.parquet.nanosAsLong
    • spark.sql.parquet.annotateVariantLogicalType
  3. AvroUtils.prepareWrite: Same treatment for Avro compression settings:

    • Zstandard buffer pool (avro.output.codec.zstd.bufferpool)
    • Compression levels (avro.mapred.<codec>.level)

Why are the changes needed?

Per-write options (passed via DataFrameWriter.option() or DataStreamWriter.option()) should take precedence over session-level SQLConf defaults. This is already the case for compression codecs in both Parquet and Avro, but other write configuration keys had their per-write values silently overwritten. For example, setting spark.sql.parquet.outputTimestampType as a write option in a streaming sink had no effect because (a) FileStreamSink doesn't merge options into the Hadoop conf, and (b) prepareWrite always replaced the value with the session config.

Does this PR introduce any user-facing change?

Yes. Per-write options for the listed keys now take effect instead of being silently ignored. Previously, only the session-level SQLConf value was used regardless of what was passed as a write option.

How was this patch tested?

  • ParquetEncodingSuite: test verifying per-write outputTimestampType overrides session config for batch writes.
  • FileStreamSinkV1Suite: test verifying per-write outputTimestampType overrides session config for streaming writes (exercises the FileStreamSink path that uses newHadoopConf() without options).

Both tests fail on master and pass with this PR.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (claude-opus-4-6)

…on config in Parquet and Avro

Co-authored-by: Isaac
@cloud-fan cloud-fan changed the title [SPARK-xxxx][SQL] Per-write options should take precedence over session config in Parquet and Avro [SPARK-56414][SQL] Per-write options should take precedence over session config in Parquet and Avro Apr 9, 2026
@cloud-fan cloud-fan changed the title [SPARK-56414][SQL] Per-write options should take precedence over session config in Parquet and Avro [SPARK-56414][SQL] Per-write options should take precedence over session config in file source writes Apr 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant