Skip to content

fix(webui): improve objects list performance for branches with uncommitted deletes#10241

Closed
caoergou wants to merge 3 commits intotreeverse:masterfrom
caoergou:feat-10008-webui-performance
Closed

fix(webui): improve objects list performance for branches with uncommitted deletes#10241
caoergou wants to merge 3 commits intotreeverse:masterfrom
caoergou:feat-10008-webui-performance

Conversation

@caoergou
Copy link
Copy Markdown
Contributor

@caoergou caoergou commented Mar 7, 2026

Problem

When displaying objects on a branch in the WebUI, the page loads slowly when there are many uncommitted deletes. This is because both objects.list(branch) and refs.changes(branch) scan the same staging area entries.

Solution

Use the committed-only ref (branch@) when listing objects for branches. The @ modifier tells lakeFS to only return committed data, skipping the staging area scan entirely.

Changes

  • Modified objects.jsx to use reference.id + '@' for branch references when listing objects
  • Added reference.type to the dependency array

Performance Impact

For branches with many uncommitted changes:

  • Before: Scans staging area twice (objects.list + refs.changes)
  • After: Scans staging area once (only refs.changes)

Testing

  • ESLint passes
  • All existing tests pass (72 tests)

Fixes #10008

…itted deletes

When displaying objects on a branch, use the committed-only ref (branch@)
instead of the full branch ref to list objects. This avoids scanning the
staging area twice - once for objects.list and once for refs.changes.

For branches with many uncommitted deletes, this significantly improves
page load performance.

The '@' modifier tells lakeFS to only return committed data, skipping
the staging area scan entirely.

Fixes treeverse#10008
@github-actions github-actions Bot added the area/UI Improvements or additions to UI label Mar 7, 2026
@arielshaqed arielshaqed requested review from a team and UdiBen March 7, 2026 15:35
Copy link
Copy Markdown
Contributor

@UdiBen UdiBen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@caoergou Thanks for the contribution!
I added one comment - LMK if it's unclear

// Use committed-only ref (branch@) for branches to avoid scanning staging area twice
// This significantly improves performance when there are many uncommitted deletes
const listRef = reference.type === RefTypeBranch ? reference.id + '@' : reference.id;
return objects.list(repo.id, listRef, path, after, config.pre_sign_support_ui);
Copy link
Copy Markdown
Contributor

@UdiBen UdiBen Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using branch@ for objects.list means uncommitted additions no longer appear in the objects list.
mergeResults only injects missing removed entries and common_prefix entries into the merged view (line 53 of mergeResults.ts) — it doesn't inject added entries that are absent from the committed listing.

Repro: upload a new file to a branch without committing → it won't show in the objects tab.

…listing

When using committed-only ref (branch@) for objects.list to improve
performance, uncommitted added files are not returned by the API.
This fix updates mergeResults to also include 'added' type changes
in the missingItems filter, ensuring newly uploaded files are shown
in the objects list.

Fixes the issue reported by @UdiBen: uncommitted additions no longer
appear in the objects list when using branch@ for performance.
@caoergou
Copy link
Copy Markdown
Contributor Author

Hi @UdiBen,

Thank you for the detailed feedback! I understand the issue now.

Problem: Using branch@ for objects.list means uncommitted additions are not returned, and mergeResults only injects removed entries and common_prefix entries into the merged view.

Fix: I've updated mergeResults.ts to also include added type changes in the missingItems filter. This ensures that newly uploaded (but uncommitted) files are added to the objects list when using committed-only ref.

Change:

// Before
.filter((change) => change.type === 'removed' || change.path_type === 'common_prefix')

// After
.filter((change) => change.type === 'removed' || change.type === 'added' || change.path_type === 'common_prefix')

This approach:

  1. Keeps the performance improvement (using branch@ to avoid scanning staging area twice)
  2. Ensures uncommitted added files are shown in the objects list
  3. Maintains the existing behavior for removed and changed entries

Please let me know if this addresses your concern or if there are other issues to address.

@UdiBen
Copy link
Copy Markdown
Contributor

UdiBen commented Mar 12, 2026

@caoergou the fix looks good!
2 failing checks:

  • prettier fails (you can run npm run format from webui)
  • E2E test fails. If you need assistance with the fix or running E2E locally LMK

When using committed-only ref (branch@) for objects.list, newly uploaded
(staged) files may be alphabetically after the last committed file and
were incorrectly excluded from the objects list.

Root cause: the lastResultPath filter in mergeResults applied to 'added'
items, but staged files can sort after committed files (e.g. 'test-upload.txt'
after 'lakes.parquet'). changesData is already paginated via the 'after'
cursor, so 'added' items do not need this extra restriction.

Fix: exempt 'added' items from the lastResultPath restriction. The filter
now only applies to 'removed' entries and common_prefix entries.

Also apply prettier formatting to objects.jsx.
@caoergou
Copy link
Copy Markdown
Contributor Author

Hi @UdiBen,

I've addressed both failing checks:

1. Prettier formatting ✅

Ran npm run format from webui/ — reformatted mergeResults.ts and objects.jsx.

2. E2E test failures ✅

Root cause: The lastResultPath filter in mergeResults.ts was incorrectly applied to added items (staged files not yet committed).

When using branch@ for objects.list, staged files don't appear in the committed listing. They are supposed to be injected via mergeResults from the diff. However, staged files may sort after the last committed file — for example, test-upload.txt sorts after lakes.parquet from the sample data, so the filter:

.filter((change) => lastResultPath && change.path <= lastResultPath)

was excluding them entirely, causing the E2E tests to fail.

Fix: Exempt added items from the lastResultPath restriction. The changesData is already paginated using the same after cursor as objects.list, so added items won't overflow page boundaries.

// Before
.filter((change) => lastResultPath && change.path <= lastResultPath)

// After
.filter((change) => change.type === 'added' || (lastResultPath && change.path <= lastResultPath))

All 23 existing unit tests pass.

@caoergou caoergou requested a review from UdiBen March 20, 2026 03:13
@UdiBen
Copy link
Copy Markdown
Contributor

UdiBen commented Mar 25, 2026

@caoergou
Thanks for tackling this, I really appreciate the effort - the performance issue with many uncommitted deletes is real and worth solving.

My concern is with the approach of splitting the listing into two independently-paginated sources (committed objects via branch@ + changes via diff) and merging them client-side. These two streams have different page boundaries, and mergeResults only sees one page at a time with no cross-page state. This makes it fundamentally hard to get right.

For example, here's a test case that is failing with the fix:

  describe('pagination: added entries must not duplicate across pages', () => {
      it('does not inject the same added entry on two consecutive pages', () => {
          // Simulates two consecutive pages of committed-only listing (branch@).
          // An added file "data/file-49999-added" appears in changesData for BOTH pages
          // because the changes cursor overlaps at page boundaries.
          // mergeResults must not return it on both pages — that causes duplicates in the UI.

          // --- Page N: committed objects end at "data/file-49958" ---
          const pageNResults: Entry[] = [
              { path: 'data/file-49869', path_type: 'object' },
              { path: 'data/file-49900', path_type: 'object' },
              { path: 'data/file-49958', path_type: 'object' },
          ];
          const pageNChanges: ChangesData = {
              results: [
                  { path: 'data/file-49869', type: 'removed', path_type: 'object' },
                  { path: 'data/file-49900', type: 'removed', path_type: 'object' },
                  { path: 'data/file-49958', type: 'removed', path_type: 'object' },
                  { path: 'data/file-49999-added', type: 'added', path_type: 'object' },
              ],
          };

          const mergedPageN = mergeResults(pageNResults, pageNChanges, false);

          // --- Page N+1: committed objects start after "data/file-49958" ---
          const pageN1Results: Entry[] = [
              { path: 'data/file-49959', path_type: 'object' },
              { path: 'data/file-50000', path_type: 'object' },
              { path: 'data/file-5047', path_type: 'object' },
          ];
          const pageN1Changes: ChangesData = {
              results: [
                  { path: 'data/file-49959', type: 'removed', path_type: 'object' },
                  { path: 'data/file-49999-added', type: 'added', path_type: 'object' },
                  { path: 'data/file-50000', type: 'removed', path_type: 'object' },
                  { path: 'data/file-5047', type: 'removed', path_type: 'object' },
              ],
          };

          const mergedPageN1 = mergeResults(pageN1Results, pageN1Changes, false);

          // Count how many times the added file appears across both pages
          const pageNCount = mergedPageN.filter((r) => r.path === 'data/file-49999-added').length;
          const pageN1Count = mergedPageN1.filter((r) => r.path === 'data/file-49999-added').length;
          const totalAppearances = pageNCount + pageN1Count;

          // The file must appear AT MOST once across both pages (0 is acceptable if the
          // implementation doesn't inject added entries; >1 means duplication bug)
          expect(totalAppearances).toBeLessThanOrEqual(1);
      });
  });

My suggestion is to add few tests that cover the altered functionality:

  • mergeResults with added entry injection (added file between two committed files on same page / added file on last page after all committed objects etc.)
  • Pagination edge cases (Single committed file + many added files / No committed objects but added files exist etc.)

Once we are covered we will have a better confident with merging

@caoergou
Copy link
Copy Markdown
Contributor Author

@UdiBen After deeper investigation, I need to revise my earlier proposal.

My "skip tombstones" suggestion was incorrect. Tombstones (removed entries) are essential for FilterTombstoneIterator to correctly exclude deleted files from the merged listing. Without them, committed-but-deleted files would incorrectly appear.

The real bottleneck is in listStagingAreaWithoutCompaction():

  • It scans the entire staging area regardless of pagination
  • Even if the API only returns 100 results, all staging entries are processed
  • Pagination is applied after the merge, not during staging scan

Why client-side merge doesn't work well:

  • Even if we only inject added entries, pagination boundaries between objects.list and refs.changes are misaligned
  • This causes duplicates or missing entries across pages

Potential backend optimization directions (need further investigation):

  1. Push down pagination cursor to staging scan (only scan path > after)
  2. Lazy merging similar to merge-sort iteration
  3. Optimize staging storage for range queries

However, these are significant architectural changes that require discussion.

My recommendation:

  • This PR addresses a real pain point, but the client-side merge approach has fundamental limitations
  • A proper solution likely requires backend optimization
  • I suggest we pause this PR and open a discussion issue to gather more input from maintainers

I'm happy to help investigate the backend optimization options if that's the desired direction.

What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/UI Improvements or additions to UI contributor

Projects

None yet

Development

Successfully merging this pull request may close these issues.

webui can cause lakeFS to go over uncommitted deletes twice

3 participants