feat(manifest): add ManifestFilterManager and ManifestMergeManager#652
feat(manifest): add ManifestFilterManager and ManifestMergeManager#652gty404 wants to merge 1 commit into
Conversation
Implement two manifest management classes for table write operations: - ManifestFilterManager: filters manifest entries by row filter expression, file path, or partition value; supports FailMissingDeletePaths validation. Rewrites manifests that contain matching files, marking entries as DELETED; passes through manifests that cannot contain matching files unchanged. - ManifestMergeManager: merges small manifests using greedy bin-packing, grouping by partition_spec_id (manifests with different specs are never merged). Oversized manifests pass through unchanged. ADDED entries from prior manifests become EXISTING when merged (matching Java semantics).
| std::vector<ManifestFile> result; | ||
|
|
||
| for (const auto& manifest : group) { | ||
| if (manifest.manifest_length >= target_size_bytes_) { |
There was a problem hiding this comment.
Java uses ListPacker.packEnd(group, ManifestFile::length) with lookback 1 specifically to preserve manifest order and leave the under-filled bin at the beginning. This forward hand-rolled packing keeps the current small bin open across an oversized manifest, so a sequence like [small, oversized, small] can merge the two small manifests across the oversized one and reorder output relative to Java.
| /// entries are rewritten with those entries marked DELETED. | ||
| /// | ||
| /// The manager is content-agnostic: pass ManifestContent::kData to process data | ||
| /// manifests, or ManifestContent::kDeletes to process delete manifests. |
There was a problem hiding this comment.
If this manager is intended to cover delete manifests, it is still missing the Java DeleteFileFilterManager cleanup semantics. Java sets dropDeleteFilesOlderThan(minDataSequenceNumber) and removeDanglingDeletesFor(filesToBeDeleted), then marks old delete files and dangling DVs as deleted while rewriting delete manifests. This implementation only applies path/partition/row-filter deletes, so a data-file delete or truncate can leave obsolete delete files/DVs live in the delete manifests, diverging from Java behavior and its DV cleanup tests. If delete-manifest cleanup is planned in a later layer, it would be good to avoid advertising kDeletes parity here or make the missing behavior explicit.
| const ManifestFile& first = all[0]; | ||
|
|
||
| // Group manifests by partition_spec_id — never merge across specs | ||
| std::map<int32_t, std::vector<ManifestFile>> by_spec; |
There was a problem hiding this comment.
Could we keep data and delete manifests type-separated in this API? Java has distinct DataFileMergeManager and DeleteFileMergeManager, and MergingSnapshotProducer passes dataManifests/deleteManifests separately. Here the grouping key is only partition_spec_id, while FlushBin creates the writer from the first manifest content; if a caller accidentally gives a data and delete manifest with the same spec in one call, the bin can mix contents and fail mid-rewrite, a state Java typed managers make impossible. Either include content in the grouping/constructor or make the manager content-specific, and add a mixed-content / kDeletes merge test.
| /// \param file_io File IO used to open existing manifests for reading | ||
| /// \param writer_factory Factory to create new ManifestWriter instances | ||
| /// \return The merged manifest list, or an error | ||
| Result<std::vector<ManifestFile>> MergeManifests( |
There was a problem hiding this comment.
Could this manager expose the lifecycle state Java relies on for retries? Java ManifestMergeManager caches merged manifests, reports replacedManifestsCount(), and has cleanUncommitted(committed) so MergingSnapshotProducer can delete manifests produced by failed retry attempts and build created/kept/replaced summary accurately. This API only returns the final vector, so the later snapshot update layer has no way to distinguish committed vs uncommitted merged outputs or count replaced manifests consistently with Java tests such as transaction retry merge cleanup. The same lifecycle surface is also needed on the filter side for rewritten manifests.
| /// | ||
| /// The factory receives the partition spec ID (to look up the spec) and the manifest | ||
| /// content type, and returns a new ManifestWriter ready for writing. The caller | ||
| /// (i.e. MergingSnapshotUpdate in PR2) captures metadata, FileIO, and snapshot ID |
There was a problem hiding this comment.
It seems MergingSnapshotUpdate isn't in the codebase. I don't mind you put it here if you plan to add it in a follow-up PR, but the PR2 is very confusing.
| PartitionSet drop_partitions_; | ||
| bool fail_missing_delete_paths_{false}; | ||
|
|
||
| // Cache: (spec_id, expr_index) → ManifestEvaluator |
There was a problem hiding this comment.
I had previously suggested in another PR that we stick to ASCII characters in comments. However, after searching the codebase, it looks like this style is already in use, so I'm fine with it now.
| /// \brief Write a merged manifest from all manifests in a bin. | ||
| /// | ||
| /// Entries are written snapshot-aware: | ||
| /// - ADDED from \p snapshot_id → WriteAddedEntry (preserve status) |
| /// \brief Merge a group of manifests sharing the same spec_id. | ||
| /// | ||
| /// \param first The overall first (newest) manifest across all groups, used to | ||
| /// apply the minCountToMerge threshold on the bin that contains it. |
There was a problem hiding this comment.
nit, min_count_to_merge is a documented param, so I think we should use that. Other places using minCountToMerge are fine because those are not in document.
| /// apply the minCountToMerge threshold on the bin that contains it. | |
| /// apply the min_count_to_merge threshold on the bin that contains it. |
Implement two manifest management classes for table write operations:
ManifestFilterManager: filters manifest entries by row filter expression, file path, or partition value; supports FailMissingDeletePaths validation. Rewrites manifests that contain matching files, marking entries as DELETED; passes through manifests that cannot contain matching files unchanged.
ManifestMergeManager: merges small manifests using greedy bin-packing, grouping by partition_spec_id (manifests with different specs are never merged). Oversized manifests pass through unchanged. ADDED entries from prior manifests become EXISTING when merged (matching Java semantics).