[fix](variant) normalize legacy single-part dot-key subcolumn paths on read#62409
[fix](variant) normalize legacy single-part dot-key subcolumn paths on read#62409csun5285 wants to merge 1 commit intoapache:masterfrom
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
1d0ffa8 to
b649c8f
Compare
|
run buildall |
b649c8f to
06bb9c6
Compare
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
1 similar comment
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
/review |
There was a problem hiding this comment.
No blocking issues found in the scoped PR files. The change is small and targeted, and the new unit tests cover both the direct normalization helper and the legacy-footer reader-init regression path.\n\nCritical checkpoints:\n- Goal of current task: Met. The PR normalizes legacy single-part dot-key subcolumn paths during , which matches the described compaction/read-path mismatch, and the added BE tests exercise the intended fix.\n- Modification size/focus: Yes. The actual PR on GitHub is limited to , , and the associated BE test.\n- Concurrency: No new concurrency concern identified. The normalization happens during reader initialization before the path is inserted into the in-memory subcolumn meta tree.\n- Lifecycle/static initialization: No special lifecycle or static-init risk identified.\n- Configuration items: None added in this PR.\n- Incompatible changes / compatibility: The change is compatibility-oriented on read and does not change on-disk format.\n- Parallel code paths: The relevant reader-init path is updated where the legacy protobuf path parts are consumed. I did not find another matching path in this PR that also required the same normalization.\n- Special conditional checks: Reasonable. The helper explicitly bails out when nested/array metadata is present so it does not silently discard non-default part metadata.\n- Test coverage: Good for the scope of the fix. There is direct unit coverage for normalization shapes plus a footer-mutation regression test covering the legacy serialized form.\n- Observability: No extra observability appears necessary for this narrowly scoped compatibility fix.\n- Transaction/persistence: Not applicable.\n- Data writes/modifications: The change reduces an existing data-loss risk during compaction by making legacy read metadata match current query/compaction path expectations.\n- FE/BE variable passing: Not applicable.\n- Performance: Negligible impact; the normalization is a small init-time operation on affected legacy paths only.\n- Other issues: None blocking found in the PR files under review.\n\nResidual risk: this intentionally interprets legacy flat dot-keys using the modern dot-split path shape, so any corner case not represented by the added tests would most likely be around unusual path metadata combinations. The current bail-out coverage makes that risk look acceptable.
There was a problem hiding this comment.
No blocking issues found in the scoped PR files. The change is small and targeted, and the new unit tests cover both the direct PathInData normalization helper and the legacy-footer reader-init regression path.
Critical checkpoints:
- Goal of current task: Met. The PR normalizes legacy single-part dot-key subcolumn paths during
VariantColumnReader::init(), which matches the described compaction/read-path mismatch, and the added BE tests exercise the intended fix. - Modification size/focus: Yes. The actual PR on GitHub is limited to
variant_column_reader,PathInData, and the associated BE test. - Concurrency: No new concurrency concern identified. The normalization happens during reader initialization before the path is inserted into the in-memory subcolumn meta tree.
- Lifecycle/static initialization: No special lifecycle or static-init risk identified.
- Configuration items: None added in this PR.
- Incompatible changes / compatibility: The change is compatibility-oriented on read and does not change on-disk format.
- Parallel code paths: The relevant reader-init path is updated where the legacy protobuf path parts are consumed. I did not find another matching path in this PR that also required the same normalization.
- Special conditional checks: Reasonable. The helper explicitly bails out when nested/array metadata is present so it does not silently discard non-default part metadata.
- Test coverage: Good for the scope of the fix. There is direct unit coverage for normalization shapes plus a footer-mutation regression test covering the legacy serialized form.
- Observability: No extra observability appears necessary for this narrowly scoped compatibility fix.
- Transaction/persistence: Not applicable.
- Data writes/modifications: The change reduces an existing data-loss risk during compaction by making legacy read metadata match current query/compaction path expectations.
- FE/BE variable passing: Not applicable.
- Performance: Negligible impact; the normalization is a small init-time operation on affected legacy paths only.
- Other issues: None blocking found in the PR files under review.
Residual risk: this intentionally interprets legacy flat dot-keys using the modern dot-split path shape, so any corner case not represented by the added tests would most likely be around unusual path metadata combinations. The current bail-out coverage makes that risk look acceptable.
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
06bb9c6 to
15f042f
Compare
|
run buildall |
1 similar comment
|
run buildall |
|
/review |
There was a problem hiding this comment.
Requesting changes because the new read-path normalization is not semantics-preserving.
Critical checkpoints:
- Goal of the task: Partially met. The PR tries to make legacy flat dot-key metadata readable again, but the current fix also rewrites legitimate flat-key paths into nested paths. The added test proves the single-path lookup case, not the conflicting/coexisting case that matters for correctness.
- Is the modification as small, clear, and focused as possible?: Yes, the code change is small and focused, but the normalization is too broad for the compatibility problem it is trying to solve.
- Concurrency: Not involved in this change.
- Lifecycle / static initialization: Not involved in this change.
- Configuration: No new configuration added.
- Incompatible changes / compatibility: Blocking issue. Rebuilding every non-typed, non-nested path from its dotted string changes persisted path semantics for legacy rowsets and can merge a flat key
"a.b"with a real nested path"a" -> "b". - Functionally parallel code paths: Yes.
get_subcolumn_meta_by_path()already had a localized dotted-string fallback for leaf lookup; moving this rewrite intoinit()changes the canonical_subcolumns_meta_infotree used by all downstream readers and compaction logic. - Special conditional checks: The
!get_is_typed() && !has_nested_part()guard is not sufficient to identify only malformed legacy metadata. Legitimate flat keys also satisfy it. - Test coverage: Insufficient for this fix. The new unit test covers one rewritten legacy path, but it does not cover coexistence of
{"a.b": ...}and{"a": {"b": ...}}, which Doris currently treats as distinct structures. - Observability: No additional observability concerns for this patch.
- Transaction / persistence / data writes: High risk. Once the reader tree conflates those two paths, compaction can read or rewrite the wrong subcolumn values.
- FE/BE variable passing: Not applicable.
- Performance: Neutral.
- Other issues: None beyond the blocking correctness issue above.
I did not find evidence that the current implementation preserves the existing distinction between flat dot-keys and nested keys, so I cannot approve it in its current form.
…n read
Old versions (cloud-4.1.2 with variant_max_subcolumns_count=0) store a
flat JSON key like {"a.b": 1} as a single PathInData part "a.b" in the
segment's ColumnPathInfo protobuf. New master compaction builds query
paths via PathInData(string) which splits on dots, so the 3-part path
does not match the 1-part tree node, find_leaf returns nullptr, and the
read plan silently falls back to DEFAULT_FILL. Compaction then rewrites
the rowset with empty values, losing the original data.
Normalize the legacy relative path in VariantColumnReader::init() by
re-constructing PathInData from its string form. This splits the
single-part dot key into multiple parts so the meta tree shape matches
what both compaction schema and query path expect.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
15f042f to
c6d0982
Compare
|
run buildall |
|
/review |
|
run buildall |
There was a problem hiding this comment.
Overall opinion: looks good to merge. I did not find any blocking issues.
Critical checkpoint conclusions:
- Goal of the task: The PR aims to preserve legacy variant data with flat dot-keys during read/compaction by normalizing malformed legacy
PathInDatametadata inVariantColumnReader::init(). The code change accomplishes that goal, and the added BE unit test demonstrates the malformed-footer case is normalized into the expected in-memory tree shape. - Scope / minimality: The change is small and focused: one read-side normalization in
be/src/storage/segment/variant/variant_column_reader.cppplus one targeted unit test inbe/test/storage/segment/variant_column_writer_reader_test.cpp. - Concurrency: No new concurrency is introduced. The change runs during reader initialization before the tree is published for concurrent lookup, so I did not find new lock-order or thread-safety risks.
- Lifecycle / static initialization: No special lifecycle or static-initialization concerns were introduced.
- Configuration: No new config items were added.
- Compatibility: This is a backward-compatibility fix for legacy on-disk metadata. It changes only the in-memory tree shape used by the reader and does not change persisted formats, so mixed-version storage compatibility risk appears low.
- Parallel code paths: The existing query-side fallback for dotted paths is still present, and this PR closes the missing reader-init / compaction-side path-shape gap. I did not find another obviously missed parallel path in the modified flow.
- Special conditional checks: The new condition is limited to untyped, non-nested paths, which is consistent with the legacy malformed shape being fixed and avoids touching typed / nested metadata.
- Test coverage: The new BE unit test is relevant and targeted. Residual risk: it validates normalized metadata lookup rather than running an end-to-end compaction round-trip, so the exact compaction rewrite path is still covered indirectly rather than directly.
- Observability: No additional logs or metrics look necessary for this localized metadata normalization.
- Transaction / persistence: Not applicable; this is a read-side metadata interpretation fix and does not add EditLog or persistence changes.
- Data writes / modifications: No new write-path behavior is introduced. The intent is to prevent silent data loss during compaction; I did not find a new atomicity or crash-consistency issue in the patch itself.
- FE/BE variable passing: Not applicable.
- Performance: The extra
PathInData(relative_path.get_path())rebuild happens once per subcolumn during init and is negligible. - Other issues: No additional blocking issues found.
Residual risk / note:
- I would still like broader regression coverage for the full compaction round-trip scenario over a legacy-produced segment, but I do not consider that gap a blocker for this targeted fix.
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
Problem
Legacy variant data from 3.1.x stores a flat JSON key like
{"a.b": 1}as a singlePathInDatapart"a.b". Master's compaction rebuilds the schema viaPathInData(string), which splits on dots and produces a multi-part shape that no longer matches the legacy single-part tree.Fix
In
VariantColumnReader::init(), when adding a subcolumn to the meta tree, re-split its relative path via the string constructor if it has no typed / nested metadata. This aligns the tree shape with the new compaction schema, while the query-side fallback already handles the reverse direction.What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)