[index] Reuse incremental global index planning#8407
Conversation
leaves12138
left a comment
There was a problem hiding this comment.
Thanks for the PR. The global-index planning changes look directionally good, but Python lint currently fails and blocks this PR.
paimon-python/pypaimon/globalindex/create_global_index.py imports three helpers that are no longer used:
_calc_row_range_indexed_split_for_row_range_split_one_by_contiguous_row_range
This reproduces locally as flake8 F401:
pypaimon/globalindex/create_global_index.py:36:1: F401 'pypaimon.globalindex.build_plan.calc_row_range as _calc_row_range' imported but unused
pypaimon/globalindex/create_global_index.py:36:1: F401 'pypaimon.globalindex.build_plan.indexed_split_for_row_range as _indexed_split_for_row_range' imported but unused
pypaimon/globalindex/create_global_index.py:36:1: F401 'pypaimon.globalindex.build_plan.split_one_by_contiguous_row_range as _split_one_by_contiguous_row_range' imported but unused
Please remove the unused imports and rerun Python lint.
Local verification performed:
git diff --cached --checkpassedpython -m py_compile paimon-python/pypaimon/globalindex/build_plan.py paimon-python/pypaimon/globalindex/create_global_index.pypassedPYTHONPATH=$PWD/paimon-python python -m pytest paimon-python/pypaimon/tests/global_index_build_test.py -qpassed: 24 testsmvn -pl paimon-flink/paimon-flink-common -am -Pfast-build -DfailIfNoTests=false -Dtest=GenericIndexTopoBuilderTest,SortedIndexTopoBuilderTest testpassed: 33 testsmvn -pl paimon-spark/paimon-spark-common -am -Pfast-build -DfailIfNoTests=false -Dtest=CreateGlobalIndexProcedureTest testpassed: 11 JUnit tests plus Spark ScalaTest
I also tried the core target; GlobalIndexBuilderUtilsTest passed, but the existing SortedGlobalIndexBuilderTest path failed in my local environment due to CodeGenerator plugin discovery (Found 0 classes implementing org.apache.paimon.codegen.CodeGenerator), so I did not treat that as a PR logic failure.
|
Addressed the Python lint review comments. Changes:
Verification:
|
leaves12138
left a comment
There was a problem hiding this comment.
LGTM. Thanks for the update.
I rechecked the latest commit and the previous Python lint blocker is fixed. The unused imports were removed from create_global_index.py, and the tests now import the planner helpers from build_plan.py directly.
Latest local verification:
git diff --cached --checkpython -m py_compile paimon-python/pypaimon/globalindex/build_plan.py paimon-python/pypaimon/globalindex/create_global_index.pypython -m flake8 --config=./dev/cfg.ini pypaimon/globalindex/build_plan.py pypaimon/globalindex/create_global_index.py pypaimon/tests/global_index_build_test.pyPYTHONPATH=$PWD/paimon-python python -m pytest paimon-python/pypaimon/tests/global_index_build_test.py -q(24 passed)
The Java/Flink/Spark planning code is unchanged from my previous pass, where the focused Flink and Spark tests passed locally.
leaves12138
left a comment
There was a problem hiding this comment.
I found one remaining blocker from CI after approving: the Java spotless check is failing.
The failing job reports formatting violations in paimon-core/src/test/java/org/apache/paimon/globalindex/GlobalIndexBuilderUtilsTest.java:
Failed to execute goal com.diffplug.spotless:spotless-maven-plugin:2.13.0:check ...
The following files had format violations:
src/test/java/org/apache/paimon/globalindex/GlobalIndexBuilderUtilsTest.java
Run 'mvn spotless:apply' to fix these violations.
The suggested changes are only formatting, for example collapsing these wrapped statements:
List<ManifestEntry> entries = Arrays.asList(createEntry(0L, 100), createEntry(100L, 100));
assertThat(splits.get(0).dataSplit().dataFiles()).containsExactly(entries.get(0).file());
The Python lint issue from my previous review is fixed locally:
- changed-file flake8 passed
global_index_build_test.pypassed: 24 tests
Please run spotless or apply the formatting fix, then I can re-approve.
|
Fixed the CI failure caused by Spotless formatting in Root cause:
Verification:
|
leaves12138
left a comment
There was a problem hiding this comment.
Thanks for fixing the core test formatting. I rechecked the latest head and found one remaining formatting blocker from CI.
A Flink build_test job is now failing spotless in paimon-flink/paimon-flink-common/src/test/java/org/apache/paimon/flink/globalindex/GenericIndexTopoBuilderTest.java. The reported violations are formatting-only, mostly wrapped computeShardTasks assignments that spotless wants on one line, for example:
List<IndexedSplit> tasks = GenericIndexTopoBuilder.computeShardTasks(table, entries, 100);
CI reports multiple similar locations and ends with:
The following files had format violations:
src/test/java/org/apache/paimon/flink/globalindex/GenericIndexTopoBuilderTest.java
Run 'mvn spotless:apply' to fix these violations.
What I verified locally on the latest head:
git diff --cached --checkpassedmvn -pl paimon-core -am -DskipTests -DfailIfNoTests=false spotless:checkpassed, so the previous core spotless issue is fixed- changed-file Python flake8 passed
Please run/apply spotless for the Flink test file too, then I can re-approve.
leaves12138
left a comment
There was a problem hiding this comment.
LGTM. Thanks for the update.
I rechecked the latest head and the remaining formatting blockers are fixed. The core and Flink spotless issues from CI are addressed, and the Python lint fix remains clean.
Latest local verification:
git diff --cached --checkmvn -pl paimon-core,paimon-flink/paimon-flink-common -am -DskipTests -DfailIfNoTests=false spotless:checkpython -m flake8 --config=./dev/cfg.ini pypaimon/globalindex/build_plan.py pypaimon/globalindex/create_global_index.py pypaimon/tests/global_index_build_test.py
Earlier focused validation on the same implementation:
PYTHONPATH=$PWD/paimon-python python -m pytest paimon-python/pypaimon/tests/global_index_build_test.py -q(24 passed)mvn -pl paimon-flink/paimon-flink-common -am -Pfast-build -DfailIfNoTests=false -Dtest=GenericIndexTopoBuilderTest,SortedIndexTopoBuilderTest test(33 passed)mvn -pl paimon-spark/paimon-spark-common -am -Pfast-build -DfailIfNoTests=false -Dtest=CreateGlobalIndexProcedureTest test
Summary
This PR makes global index build incremental by default across Java/Flink/Spark/Python and extracts shared planning logic for reusable row-range/shard handling.
Changes
globalindex.build_planmodule and route Python index builders through the same planning model.Testing
python -m pytest paimon-python/pypaimon/tests/global_index_build_test.pymvn -pl paimon-core -am -Pfast-build -DfailIfNoTests=false -Dtest=GlobalIndexBuilderUtilsTest,SortedGlobalIndexBuilderTest testmvn -pl paimon-flink/paimon-flink-common -am -Pfast-build -DfailIfNoTests=false -Dtest=SortedIndexTopoBuilderTest,GenericIndexTopoBuilderTest testmvn -pl paimon-spark/paimon-spark-common -am -Pfast-build -DfailIfNoTests=false -Dtest=CreateGlobalIndexProcedureTest testNotes
The default/procedure build path now skips already indexed data and only builds unindexed row ranges. Explicit max-indexed-row-id entry points keep their explicit semantics.