Skip to content

feat(lance): Implement canWrite() in HoodieSparkLanceWriter with configurable max file size for Lance#18341

Open
wombatu-kun wants to merge 2 commits intoapache:masterfrom
wombatu-kun:lance-canWrite
Open

feat(lance): Implement canWrite() in HoodieSparkLanceWriter with configurable max file size for Lance#18341
wombatu-kun wants to merge 2 commits intoapache:masterfrom
wombatu-kun:lance-canWrite

Conversation

@wombatu-kun
Copy link
Contributor

Describe the issue this Pull Request addresses

Closes #17684

Summary and Changelog

Implement canWrite() in HoodieSparkLanceWriter analogously to HoodieBaseParquetWriter.canWrite() by tracking cumulative Arrow buffer sizes in the base class and adding periodic size-limit checks in the Spark writer.

HoodieStorageConfig: Added LANCE_MAX_FILE_SIZE config property (key hoodie.lance.max.file.size, default 120 MB) and a lanceMaxFileSize(long) builder method, consistent with the existing Parquet/ORC/HFile config entries.

HoodieBaseLanceWriter: Added totalFlushedDataSize field, and getDataSize() accessor. In flushBatch(), after arrowWriter.finishBatch() sets the row count, the method now iterates over root.getFieldVectors() and accumulates vector.getBufferSize() into totalFlushedDataSize before writing to Lance. This provides an uncompressed Arrow buffer size estimate analogous to ParquetWriter.getDataSize().

HoodieSparkLanceWriter:

  • Added MIN_RECORDS_FOR_SIZE_CHECK = 100 and MAX_RECORDS_FOR_SIZE_CHECK = 10000 constants (mirrors the Parquet constants).
  • Added maxFileSize and recordCountForNextSizeCheck fields.
  • Updated the main constructor to accept long maxFileSize; the no-arg secondary constructor now delegates with Long.MAX_VALUE (no limit); a new secondary constructor accepting explicit maxFileSize is added for use by HoodieInternalRowFileWriterFactory.
  • canWrite() implementation: checks periodically based on recordCountForNextSizeCheck, computes average record size from getDataSize()/writtenCount, returns false when within two average records of maxFileSize, and adaptively schedules the next check.

HoodieSparkFileWriterFactory: Reads LANCE_MAX_FILE_SIZE from config and passes it to the HoodieSparkLanceWriter constructor.

HoodieInternalRowFileWriterFactory: method getInternalRowFileWriter reads LANCE_MAX_FILE_SIZE and passes it (through newLanceInternalRowFileWriter) to the new HoodieSparkLanceWriter constructor.

Impact

track a proper implementation that checks to see if the file has reached some threshold in size and if so should roll over the write to a new file

Risk Level

none

Documentation Update

Need to add LANCE_MAX_FILE_SIZE config property (hoodie.lance.max.file.size, default 120 MB)

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@wombatu-kun wombatu-kun requested review from rahil-c and voonhous March 18, 2026 15:29
@github-actions github-actions bot added the size:M PR with lines of changes in (100, 300] label Mar 18, 2026
@voonhous
Copy link
Member

Ack, will review tomorrow!

@rahil-c
Copy link
Collaborator

rahil-c commented Mar 19, 2026

Thanks @wombatu-kun for the help!

@voonhous can you review this if you get a chance, since gonna be ooto. Once back will review

HoodieStorage storage,
boolean populateMetaFields,
Option<BloomFilter> bloomFilterOpt) {
this(file, sparkSchema, instantTime, taskContextSupplier, storage, populateMetaFields, bloomFilterOpt, Long.MAX_VALUE);
Copy link
Collaborator

@rahil-c rahil-c Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we have some reasonable default here for a maxFileSize rather than Long.MAX Value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, use LANCE_MAX_FILE_SIZE.defaultValue() instead of Long.MAX_VALUE

@wombatu-kun wombatu-kun requested a review from rahil-c March 21, 2026 00:45
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@codecov-commenter
Copy link

Codecov Report

❌ Patch coverage is 15.15152% with 28 lines in your changes missing coverage. Please review.
✅ Project coverage is 54.05%. Comparing base (cad08b1) to head (d96da1e).
⚠️ Report is 6 commits behind head on master.

Files with missing lines Patch % Lines
...apache/hudi/io/storage/HoodieSparkLanceWriter.java 0.00% 16 Missing ⚠️
...rg/apache/hudi/io/lance/HoodieBaseLanceWriter.java 0.00% 5 Missing ⚠️
...torage/row/HoodieInternalRowFileWriterFactory.java 0.00% 3 Missing ⚠️
.../hudi/io/storage/HoodieSparkFileWriterFactory.java 0.00% 2 Missing ⚠️
...apache/hudi/common/config/HoodieStorageConfig.java 71.42% 2 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (cad08b1) and HEAD (d96da1e). Click for more details.

HEAD has 27 uploads less than BASE
Flag BASE (cad08b1) HEAD (d96da1e)
spark-scala-tests 10 0
utilities 1 0
spark-java-tests 15 0
common-and-other-modules 1 0
Additional details and impacted files
@@              Coverage Diff              @@
##             master   #18341       +/-   ##
=============================================
- Coverage     68.41%   54.05%   -14.37%     
+ Complexity    27408    12139    -15269     
=============================================
  Files          2423     1421     -1002     
  Lines        132458    69952    -62506     
  Branches      15972     7795     -8177     
=============================================
- Hits          90623    37812    -52811     
+ Misses        34784    28781     -6003     
+ Partials       7051     3359     -3692     
Flag Coverage Δ
common-and-other-modules ?
hadoop-mr-java-client 45.08% <41.66%> (-0.08%) ⬇️
spark-client-hadoop-common 48.20% <15.15%> (-0.12%) ⬇️
spark-java-tests ?
spark-scala-tests ?
utilities ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
.../hudi/io/storage/HoodieSparkFileWriterFactory.java 0.00% <0.00%> (-85.72%) ⬇️
...apache/hudi/common/config/HoodieStorageConfig.java 86.64% <71.42%> (-2.78%) ⬇️
...torage/row/HoodieInternalRowFileWriterFactory.java 0.00% <0.00%> (-88.89%) ⬇️
...rg/apache/hudi/io/lance/HoodieBaseLanceWriter.java 0.00% <0.00%> (-66.67%) ⬇️
...apache/hudi/io/storage/HoodieSparkLanceWriter.java 0.00% <0.00%> (-95.24%) ⬇️

... and 1770 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:M PR with lines of changes in (100, 300]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement canWrite() in HoodieSparkLanceWriter with some configurable max size

5 participants