Skip to content

feat(manifest): add ManifestFilterManager and ManifestMergeManager#652

Open
gty404 wants to merge 1 commit into
apache:mainfrom
gty404:manifest-manager
Open

feat(manifest): add ManifestFilterManager and ManifestMergeManager#652
gty404 wants to merge 1 commit into
apache:mainfrom
gty404:manifest-manager

Conversation

@gty404
Copy link
Copy Markdown
Contributor

@gty404 gty404 commented May 12, 2026

Implement two manifest management classes for table write operations:

  • ManifestFilterManager: filters manifest entries by row filter expression, file path, or partition value; supports FailMissingDeletePaths validation. Rewrites manifests that contain matching files, marking entries as DELETED; passes through manifests that cannot contain matching files unchanged.

  • ManifestMergeManager: merges small manifests using greedy bin-packing, grouping by partition_spec_id (manifests with different specs are never merged). Oversized manifests pass through unchanged. ADDED entries from prior manifests become EXISTING when merged (matching Java semantics).

Implement two manifest management classes for table write operations:

- ManifestFilterManager: filters manifest entries by row filter expression,
  file path, or partition value; supports FailMissingDeletePaths validation.
  Rewrites manifests that contain matching files, marking entries as DELETED;
  passes through manifests that cannot contain matching files unchanged.

- ManifestMergeManager: merges small manifests using greedy bin-packing,
  grouping by partition_spec_id (manifests with different specs are never
  merged). Oversized manifests pass through unchanged. ADDED entries from
  prior manifests become EXISTING when merged (matching Java semantics).
@gty404 gty404 force-pushed the manifest-manager branch from 4a08049 to f360476 Compare May 12, 2026 09:53
Comment thread src/iceberg/manifest/manifest_filter_manager.cc
Comment thread src/iceberg/manifest/manifest_filter_manager.cc
Comment thread src/iceberg/manifest/manifest_filter_manager.cc
Comment thread src/iceberg/manifest/manifest_merge_manager.cc
std::vector<ManifestFile> result;

for (const auto& manifest : group) {
if (manifest.manifest_length >= target_size_bytes_) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java uses ListPacker.packEnd(group, ManifestFile::length) with lookback 1 specifically to preserve manifest order and leave the under-filled bin at the beginning. This forward hand-rolled packing keeps the current small bin open across an oversized manifest, so a sequence like [small, oversized, small] can merge the two small manifests across the oversized one and reorder output relative to Java.

Comment thread src/iceberg/manifest/manifest_merge_manager.cc
Comment thread src/iceberg/manifest/manifest_merge_manager.cc
Comment thread src/iceberg/manifest/manifest_filter_manager.cc
/// entries are rewritten with those entries marked DELETED.
///
/// The manager is content-agnostic: pass ManifestContent::kData to process data
/// manifests, or ManifestContent::kDeletes to process delete manifests.
Copy link
Copy Markdown
Member

@wgtmac wgtmac May 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this manager is intended to cover delete manifests, it is still missing the Java DeleteFileFilterManager cleanup semantics. Java sets dropDeleteFilesOlderThan(minDataSequenceNumber) and removeDanglingDeletesFor(filesToBeDeleted), then marks old delete files and dangling DVs as deleted while rewriting delete manifests. This implementation only applies path/partition/row-filter deletes, so a data-file delete or truncate can leave obsolete delete files/DVs live in the delete manifests, diverging from Java behavior and its DV cleanup tests. If delete-manifest cleanup is planned in a later layer, it would be good to avoid advertising kDeletes parity here or make the missing behavior explicit.

const ManifestFile& first = all[0];

// Group manifests by partition_spec_id — never merge across specs
std::map<int32_t, std::vector<ManifestFile>> by_spec;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we keep data and delete manifests type-separated in this API? Java has distinct DataFileMergeManager and DeleteFileMergeManager, and MergingSnapshotProducer passes dataManifests/deleteManifests separately. Here the grouping key is only partition_spec_id, while FlushBin creates the writer from the first manifest content; if a caller accidentally gives a data and delete manifest with the same spec in one call, the bin can mix contents and fail mid-rewrite, a state Java typed managers make impossible. Either include content in the grouping/constructor or make the manager content-specific, and add a mixed-content / kDeletes merge test.

/// \param file_io File IO used to open existing manifests for reading
/// \param writer_factory Factory to create new ManifestWriter instances
/// \return The merged manifest list, or an error
Result<std::vector<ManifestFile>> MergeManifests(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this manager expose the lifecycle state Java relies on for retries? Java ManifestMergeManager caches merged manifests, reports replacedManifestsCount(), and has cleanUncommitted(committed) so MergingSnapshotProducer can delete manifests produced by failed retry attempts and build created/kept/replaced summary accurately. This API only returns the final vector, so the later snapshot update layer has no way to distinguish committed vs uncommitted merged outputs or count replaced manifests consistently with Java tests such as transaction retry merge cleanup. The same lifecycle surface is also needed on the filter side for rewritten manifests.

Comment thread src/iceberg/manifest/manifest_filter_manager.h
@wgtmac wgtmac requested a review from zhjwpku May 13, 2026 07:55
///
/// The factory receives the partition spec ID (to look up the spec) and the manifest
/// content type, and returns a new ManifestWriter ready for writing. The caller
/// (i.e. MergingSnapshotUpdate in PR2) captures metadata, FileIO, and snapshot ID
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems MergingSnapshotUpdate isn't in the codebase. I don't mind you put it here if you plan to add it in a follow-up PR, but the PR2 is very confusing.

PartitionSet drop_partitions_;
bool fail_missing_delete_paths_{false};

// Cache: (spec_id, expr_index) → ManifestEvaluator
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had previously suggested in another PR that we stick to ASCII characters in comments. However, after searching the codebase, it looks like this style is already in use, so I'm fine with it now.

Comment thread src/iceberg/manifest/manifest_filter_manager.h
/// \brief Write a merged manifest from all manifests in a bin.
///
/// Entries are written snapshot-aware:
/// - ADDED from \p snapshot_id → WriteAddedEntry (preserve status)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the \p?

/// \brief Merge a group of manifests sharing the same spec_id.
///
/// \param first The overall first (newest) manifest across all groups, used to
/// apply the minCountToMerge threshold on the bin that contains it.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, min_count_to_merge is a documented param, so I think we should use that. Other places using minCountToMerge are fine because those are not in document.

Suggested change
/// apply the minCountToMerge threshold on the bin that contains it.
/// apply the min_count_to_merge threshold on the bin that contains it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants