Skip to content

[index] Reuse incremental global index planning#8407

Merged
JingsongLi merged 4 commits into
apache:masterfrom
JingsongLi:codex/global-index-incremental-planner
Jul 1, 2026
Merged

[index] Reuse incremental global index planning#8407
JingsongLi merged 4 commits into
apache:masterfrom
JingsongLi:codex/global-index-incremental-planner

Conversation

@JingsongLi

Copy link
Copy Markdown
Contributor

Summary

This PR makes global index build incremental by default across Java/Flink/Spark/Python and extracts shared planning logic for reusable row-range/shard handling.

Changes

  • Add reusable core planning utilities to compute indexed/unindexed row ranges and shard indexed splits.
  • Reuse the planner in Flink and Spark generic index topologies, while sorted index builders use the shared incremental scan.
  • Add a Python globalindex.build_plan module and route Python index builders through the same planning model.
  • Extend core, Flink, Spark, and Python tests for incremental/unindexed range behavior.

Testing

  • python -m pytest paimon-python/pypaimon/tests/global_index_build_test.py
  • mvn -pl paimon-core -am -Pfast-build -DfailIfNoTests=false -Dtest=GlobalIndexBuilderUtilsTest,SortedGlobalIndexBuilderTest test
  • mvn -pl paimon-flink/paimon-flink-common -am -Pfast-build -DfailIfNoTests=false -Dtest=SortedIndexTopoBuilderTest,GenericIndexTopoBuilderTest test
  • mvn -pl paimon-spark/paimon-spark-common -am -Pfast-build -DfailIfNoTests=false -Dtest=CreateGlobalIndexProcedureTest test

Notes

The default/procedure build path now skips already indexed data and only builds unindexed row ranges. Explicit max-indexed-row-id entry points keep their explicit semantics.

@leaves12138 leaves12138 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. The global-index planning changes look directionally good, but Python lint currently fails and blocks this PR.

paimon-python/pypaimon/globalindex/create_global_index.py imports three helpers that are no longer used:

  • _calc_row_range
  • _indexed_split_for_row_range
  • _split_one_by_contiguous_row_range

This reproduces locally as flake8 F401:

pypaimon/globalindex/create_global_index.py:36:1: F401 'pypaimon.globalindex.build_plan.calc_row_range as _calc_row_range' imported but unused
pypaimon/globalindex/create_global_index.py:36:1: F401 'pypaimon.globalindex.build_plan.indexed_split_for_row_range as _indexed_split_for_row_range' imported but unused
pypaimon/globalindex/create_global_index.py:36:1: F401 'pypaimon.globalindex.build_plan.split_one_by_contiguous_row_range as _split_one_by_contiguous_row_range' imported but unused

Please remove the unused imports and rerun Python lint.

Local verification performed:

  • git diff --cached --check passed
  • python -m py_compile paimon-python/pypaimon/globalindex/build_plan.py paimon-python/pypaimon/globalindex/create_global_index.py passed
  • PYTHONPATH=$PWD/paimon-python python -m pytest paimon-python/pypaimon/tests/global_index_build_test.py -q passed: 24 tests
  • mvn -pl paimon-flink/paimon-flink-common -am -Pfast-build -DfailIfNoTests=false -Dtest=GenericIndexTopoBuilderTest,SortedIndexTopoBuilderTest test passed: 33 tests
  • mvn -pl paimon-spark/paimon-spark-common -am -Pfast-build -DfailIfNoTests=false -Dtest=CreateGlobalIndexProcedureTest test passed: 11 JUnit tests plus Spark ScalaTest

I also tried the core target; GlobalIndexBuilderUtilsTest passed, but the existing SortedGlobalIndexBuilderTest path failed in my local environment due to CodeGenerator plugin discovery (Found 0 classes implementing org.apache.paimon.codegen.CodeGenerator), so I did not treat that as a PR logic failure.

@JingsongLi

Copy link
Copy Markdown
Contributor Author

Addressed the Python lint review comments.

Changes:

  • Removed the unused planner helper imports from create_global_index.py.
  • Updated planner tests to import those helpers directly from build_plan.py.

Verification:

  • python -m py_compile paimon-python/pypaimon/globalindex/build_plan.py paimon-python/pypaimon/globalindex/create_global_index.py paimon-python/pypaimon/tests/global_index_build_test.py
  • PYTHONPATH=$PWD/paimon-python python -m pytest paimon-python/pypaimon/tests/global_index_build_test.py -q
  • /tmp/paimon-flake8-venv/bin/python -m flake8 --config=paimon-python/dev/cfg.ini paimon-python/pypaimon/globalindex/create_global_index.py paimon-python/pypaimon/globalindex/build_plan.py paimon-python/pypaimon/tests/global_index_build_test.py
  • git diff --check

@leaves12138 leaves12138 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the update.

I rechecked the latest commit and the previous Python lint blocker is fixed. The unused imports were removed from create_global_index.py, and the tests now import the planner helpers from build_plan.py directly.

Latest local verification:

  • git diff --cached --check
  • python -m py_compile paimon-python/pypaimon/globalindex/build_plan.py paimon-python/pypaimon/globalindex/create_global_index.py
  • python -m flake8 --config=./dev/cfg.ini pypaimon/globalindex/build_plan.py pypaimon/globalindex/create_global_index.py pypaimon/tests/global_index_build_test.py
  • PYTHONPATH=$PWD/paimon-python python -m pytest paimon-python/pypaimon/tests/global_index_build_test.py -q (24 passed)

The Java/Flink/Spark planning code is unchanged from my previous pass, where the focused Flink and Spark tests passed locally.

@leaves12138 leaves12138 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one remaining blocker from CI after approving: the Java spotless check is failing.

The failing job reports formatting violations in paimon-core/src/test/java/org/apache/paimon/globalindex/GlobalIndexBuilderUtilsTest.java:

Failed to execute goal com.diffplug.spotless:spotless-maven-plugin:2.13.0:check ...
The following files had format violations:
    src/test/java/org/apache/paimon/globalindex/GlobalIndexBuilderUtilsTest.java
Run 'mvn spotless:apply' to fix these violations.

The suggested changes are only formatting, for example collapsing these wrapped statements:

List<ManifestEntry> entries = Arrays.asList(createEntry(0L, 100), createEntry(100L, 100));
assertThat(splits.get(0).dataSplit().dataFiles()).containsExactly(entries.get(0).file());

The Python lint issue from my previous review is fixed locally:

  • changed-file flake8 passed
  • global_index_build_test.py passed: 24 tests

Please run spotless or apply the formatting fix, then I can re-approve.

@JingsongLi

Copy link
Copy Markdown
Contributor Author

Fixed the CI failure caused by Spotless formatting in GlobalIndexBuilderUtilsTest.

Root cause:

  • CI runs Java builds without -Pfast-build, so Spotless checked formatting and failed before Python lint/tests could run.

Verification:

  • mvn -pl paimon-core -am -DfailIfNoTests=false -Dtest=GlobalIndexBuilderUtilsTest,SortedGlobalIndexBuilderTest test
  • /tmp/paimon-flake8-venv/bin/python -m flake8 --config=paimon-python/dev/cfg.ini paimon-python/pypaimon/globalindex/create_global_index.py paimon-python/pypaimon/globalindex/build_plan.py paimon-python/pypaimon/tests/global_index_build_test.py
  • PYTHONPATH=$PWD/paimon-python python -m pytest paimon-python/pypaimon/tests/global_index_build_test.py -q
  • git diff --check

@leaves12138 leaves12138 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing the core test formatting. I rechecked the latest head and found one remaining formatting blocker from CI.

A Flink build_test job is now failing spotless in paimon-flink/paimon-flink-common/src/test/java/org/apache/paimon/flink/globalindex/GenericIndexTopoBuilderTest.java. The reported violations are formatting-only, mostly wrapped computeShardTasks assignments that spotless wants on one line, for example:

List<IndexedSplit> tasks = GenericIndexTopoBuilder.computeShardTasks(table, entries, 100);

CI reports multiple similar locations and ends with:

The following files had format violations:
    src/test/java/org/apache/paimon/flink/globalindex/GenericIndexTopoBuilderTest.java
Run 'mvn spotless:apply' to fix these violations.

What I verified locally on the latest head:

  • git diff --cached --check passed
  • mvn -pl paimon-core -am -DskipTests -DfailIfNoTests=false spotless:check passed, so the previous core spotless issue is fixed
  • changed-file Python flake8 passed

Please run/apply spotless for the Flink test file too, then I can re-approve.

@leaves12138 leaves12138 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the update.

I rechecked the latest head and the remaining formatting blockers are fixed. The core and Flink spotless issues from CI are addressed, and the Python lint fix remains clean.

Latest local verification:

  • git diff --cached --check
  • mvn -pl paimon-core,paimon-flink/paimon-flink-common -am -DskipTests -DfailIfNoTests=false spotless:check
  • python -m flake8 --config=./dev/cfg.ini pypaimon/globalindex/build_plan.py pypaimon/globalindex/create_global_index.py pypaimon/tests/global_index_build_test.py

Earlier focused validation on the same implementation:

  • PYTHONPATH=$PWD/paimon-python python -m pytest paimon-python/pypaimon/tests/global_index_build_test.py -q (24 passed)
  • mvn -pl paimon-flink/paimon-flink-common -am -Pfast-build -DfailIfNoTests=false -Dtest=GenericIndexTopoBuilderTest,SortedIndexTopoBuilderTest test (33 passed)
  • mvn -pl paimon-spark/paimon-spark-common -am -Pfast-build -DfailIfNoTests=false -Dtest=CreateGlobalIndexProcedureTest test

@JingsongLi JingsongLi merged commit ec79aac into apache:master Jul 1, 2026
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants