[fix](variant) normalize legacy single-part dot-key subcolumn paths on read by csun5285 · Pull Request #62409 · apache/doris

csun5285 · 2026-04-12T16:18:49Z

Problem

Legacy variant data from 3.1.x stores a flat JSON key like {"a.b": 1} as a single PathInData part "a.b". Master's compaction rebuilds the schema via PathInData(string), which splits on dots and produces a multi-part shape that no longer matches the legacy single-part tree.

Fix

In VariantColumnReader::init(), when adding a subcolumn to the meta tree, re-split its relative path via the string constructor if it has no typed / nested metadata. This aligns the tree shape with the new compaction schema, while the query-side fallback already handles the reverse direction.

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

Test
- Regression test
- Unit Test
- Manual test (add detailed scripts or steps below)
- No need to test or manual test. Explain why:
  - This is a refactor/code format and no logic has been changed.
  - Previous test can cover this change.
  - No code files have been changed.
  - Other reason
Behavior changed:
- No.
- Yes.
Does this need documentation?
- No.
- Yes.

Check List (For Reviewer who merge this PR)

Confirm the release note
Confirm test cases
Confirm document
Add branch pick label

Thearas · 2026-04-12T16:18:55Z

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

What problem was fixed (it's best to include specific error reporting information). How it was fixed.
Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
What features were added. Why was this function added?
Which code was refactored and why was this part of the code refactored?
Which functions were optimized and what is the difference before and after the optimization?

csun5285 · 2026-04-12T16:18:57Z

run buildall

csun5285 · 2026-04-13T06:34:09Z

run buildall

hello-stephen · 2026-04-13T12:47:05Z

BE Regression && UT Coverage Report

Increment line coverage 100.00% (25/25) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	63.22% (23512/37190)
Line Coverage	46.76% (241351/516160)
Region Coverage	43.72% (197581/451879)
Branch Coverage	45.03% (85592/190058)

hello-stephen · 2026-04-13T14:49:38Z

BE Regression && UT Coverage Report

Increment line coverage 100.00% (25/25) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	63.22% (23512/37190)
Line Coverage	46.76% (241351/516160)
Region Coverage	43.72% (197581/451879)
Branch Coverage	45.03% (85592/190058)

hello-stephen · 2026-04-13T15:08:48Z

BE Regression && UT Coverage Report

Increment line coverage 100.00% (25/25) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	63.23% (23517/37190)
Line Coverage	46.77% (241404/516160)
Region Coverage	43.69% (197444/451879)
Branch Coverage	45.04% (85601/190058)

csun5285 · 2026-04-14T03:52:30Z

/review

github-actions

No blocking issues found in the scoped PR files. The change is small and targeted, and the new unit tests cover both the direct normalization helper and the legacy-footer reader-init regression path.\n\nCritical checkpoints:\n- Goal of current task: Met. The PR normalizes legacy single-part dot-key subcolumn paths during , which matches the described compaction/read-path mismatch, and the added BE tests exercise the intended fix.\n- Modification size/focus: Yes. The actual PR on GitHub is limited to , , and the associated BE test.\n- Concurrency: No new concurrency concern identified. The normalization happens during reader initialization before the path is inserted into the in-memory subcolumn meta tree.\n- Lifecycle/static initialization: No special lifecycle or static-init risk identified.\n- Configuration items: None added in this PR.\n- Incompatible changes / compatibility: The change is compatibility-oriented on read and does not change on-disk format.\n- Parallel code paths: The relevant reader-init path is updated where the legacy protobuf path parts are consumed. I did not find another matching path in this PR that also required the same normalization.\n- Special conditional checks: Reasonable. The helper explicitly bails out when nested/array metadata is present so it does not silently discard non-default part metadata.\n- Test coverage: Good for the scope of the fix. There is direct unit coverage for normalization shapes plus a footer-mutation regression test covering the legacy serialized form.\n- Observability: No extra observability appears necessary for this narrowly scoped compatibility fix.\n- Transaction/persistence: Not applicable.\n- Data writes/modifications: The change reduces an existing data-loss risk during compaction by making legacy read metadata match current query/compaction path expectations.\n- FE/BE variable passing: Not applicable.\n- Performance: Negligible impact; the normalization is a small init-time operation on affected legacy paths only.\n- Other issues: None blocking found in the PR files under review.\n\nResidual risk: this intentionally interprets legacy flat dot-keys using the modern dot-split path shape, so any corner case not represented by the added tests would most likely be around unusual path metadata combinations. The current bail-out coverage makes that risk look acceptable.

github-actions

No blocking issues found in the scoped PR files. The change is small and targeted, and the new unit tests cover both the direct PathInData normalization helper and the legacy-footer reader-init regression path.

Critical checkpoints:

Goal of current task: Met. The PR normalizes legacy single-part dot-key subcolumn paths during VariantColumnReader::init(), which matches the described compaction/read-path mismatch, and the added BE tests exercise the intended fix.
Modification size/focus: Yes. The actual PR on GitHub is limited to variant_column_reader, PathInData, and the associated BE test.
Concurrency: No new concurrency concern identified. The normalization happens during reader initialization before the path is inserted into the in-memory subcolumn meta tree.
Lifecycle/static initialization: No special lifecycle or static-init risk identified.
Configuration items: None added in this PR.
Incompatible changes / compatibility: The change is compatibility-oriented on read and does not change on-disk format.
Parallel code paths: The relevant reader-init path is updated where the legacy protobuf path parts are consumed. I did not find another matching path in this PR that also required the same normalization.
Special conditional checks: Reasonable. The helper explicitly bails out when nested/array metadata is present so it does not silently discard non-default part metadata.
Test coverage: Good for the scope of the fix. There is direct unit coverage for normalization shapes plus a footer-mutation regression test covering the legacy serialized form.
Observability: No extra observability appears necessary for this narrowly scoped compatibility fix.
Transaction/persistence: Not applicable.
Data writes/modifications: The change reduces an existing data-loss risk during compaction by making legacy read metadata match current query/compaction path expectations.
FE/BE variable passing: Not applicable.
Performance: Negligible impact; the normalization is a small init-time operation on affected legacy paths only.
Other issues: None blocking found in the PR files under review.

Residual risk: this intentionally interprets legacy flat dot-keys using the modern dot-split path shape, so any corner case not represented by the added tests would most likely be around unusual path metadata combinations. The current bail-out coverage makes that risk look acceptable.

hello-stephen · 2026-04-14T03:58:58Z

BE Regression && UT Coverage Report

Increment line coverage 100.00% (25/25) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	63.19% (23499/37190)
Line Coverage	46.74% (241273/516160)
Region Coverage	43.67% (197353/451879)
Branch Coverage	45.02% (85561/190058)

csun5285 · 2026-04-14T12:24:39Z

run buildall

csun5285 · 2026-04-14T12:29:30Z

run buildall

csun5285 · 2026-04-14T12:29:36Z

/review

github-actions

Requesting changes because the new read-path normalization is not semantics-preserving.

Critical checkpoints:

Goal of the task: Partially met. The PR tries to make legacy flat dot-key metadata readable again, but the current fix also rewrites legitimate flat-key paths into nested paths. The added test proves the single-path lookup case, not the conflicting/coexisting case that matters for correctness.
Is the modification as small, clear, and focused as possible?: Yes, the code change is small and focused, but the normalization is too broad for the compatibility problem it is trying to solve.
Concurrency: Not involved in this change.
Lifecycle / static initialization: Not involved in this change.
Configuration: No new configuration added.
Incompatible changes / compatibility: Blocking issue. Rebuilding every non-typed, non-nested path from its dotted string changes persisted path semantics for legacy rowsets and can merge a flat key "a.b" with a real nested path "a" -> "b".
Functionally parallel code paths: Yes. get_subcolumn_meta_by_path() already had a localized dotted-string fallback for leaf lookup; moving this rewrite into init() changes the canonical _subcolumns_meta_info tree used by all downstream readers and compaction logic.
Special conditional checks: The !get_is_typed() && !has_nested_part() guard is not sufficient to identify only malformed legacy metadata. Legitimate flat keys also satisfy it.
Test coverage: Insufficient for this fix. The new unit test covers one rewritten legacy path, but it does not cover coexistence of {"a.b": ...} and {"a": {"b": ...}}, which Doris currently treats as distinct structures.
Observability: No additional observability concerns for this patch.
Transaction / persistence / data writes: High risk. Once the reader tree conflates those two paths, compaction can read or rewrite the wrong subcolumn values.
FE/BE variable passing: Not applicable.
Performance: Neutral.
Other issues: None beyond the blocking correctness issue above.

I did not find evidence that the current implementation preserves the existing distinction between flat dot-keys and nested keys, so I cannot approve it in its current form.

…n read Old versions (cloud-4.1.2 with variant_max_subcolumns_count=0) store a flat JSON key like {"a.b": 1} as a single PathInData part "a.b" in the segment's ColumnPathInfo protobuf. New master compaction builds query paths via PathInData(string) which splits on dots, so the 3-part path does not match the 1-part tree node, find_leaf returns nullptr, and the read plan silently falls back to DEFAULT_FILL. Compaction then rewrites the rowset with empty values, losing the original data. Normalize the legacy relative path in VariantColumnReader::init() by re-constructing PathInData from its string form. This splits the single-part dot key into multiple parts so the meta tree shape matches what both compaction schema and query path expect. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

csun5285 · 2026-04-14T12:44:35Z

run buildall

csun5285 · 2026-04-14T12:44:46Z

/review

csun5285 · 2026-04-14T12:53:12Z

run buildall

github-actions

Overall opinion: looks good to merge. I did not find any blocking issues.

Critical checkpoint conclusions:

Goal of the task: The PR aims to preserve legacy variant data with flat dot-keys during read/compaction by normalizing malformed legacy PathInData metadata in VariantColumnReader::init(). The code change accomplishes that goal, and the added BE unit test demonstrates the malformed-footer case is normalized into the expected in-memory tree shape.
Scope / minimality: The change is small and focused: one read-side normalization in be/src/storage/segment/variant/variant_column_reader.cpp plus one targeted unit test in be/test/storage/segment/variant_column_writer_reader_test.cpp.
Concurrency: No new concurrency is introduced. The change runs during reader initialization before the tree is published for concurrent lookup, so I did not find new lock-order or thread-safety risks.
Lifecycle / static initialization: No special lifecycle or static-initialization concerns were introduced.
Configuration: No new config items were added.
Compatibility: This is a backward-compatibility fix for legacy on-disk metadata. It changes only the in-memory tree shape used by the reader and does not change persisted formats, so mixed-version storage compatibility risk appears low.
Parallel code paths: The existing query-side fallback for dotted paths is still present, and this PR closes the missing reader-init / compaction-side path-shape gap. I did not find another obviously missed parallel path in the modified flow.
Special conditional checks: The new condition is limited to untyped, non-nested paths, which is consistent with the legacy malformed shape being fixed and avoids touching typed / nested metadata.
Test coverage: The new BE unit test is relevant and targeted. Residual risk: it validates normalized metadata lookup rather than running an end-to-end compaction round-trip, so the exact compaction rewrite path is still covered indirectly rather than directly.
Observability: No additional logs or metrics look necessary for this localized metadata normalization.
Transaction / persistence: Not applicable; this is a read-side metadata interpretation fix and does not add EditLog or persistence changes.
Data writes / modifications: No new write-path behavior is introduced. The intent is to prevent silent data loss during compaction; I did not find a new atomicity or crash-consistency issue in the patch itself.
FE/BE variable passing: Not applicable.
Performance: The extra PathInData(relative_path.get_path()) rebuild happens once per subcolumn during init and is negligible.
Other issues: No additional blocking issues found.

Residual risk / note:

I would still like broader regression coverage for the full compaction round-trip scenario over a legacy-produced segment, but I do not consider that gap a blocker for this targeted fix.

hello-stephen · 2026-04-14T15:19:55Z

BE Regression && UT Coverage Report

Increment line coverage 100.00% (3/3) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	73.62% (27394/37208)
Line Coverage	57.28% (295863/516491)
Region Coverage	54.58% (246816/452210)
Branch Coverage	56.18% (106845/190198)

csun5285 force-pushed the fix/variant-dot-key-access branch 5 times, most recently from 1d0ffa8 to b649c8f Compare April 13, 2026 06:33

csun5285 force-pushed the fix/variant-dot-key-access branch from b649c8f to 06bb9c6 Compare April 13, 2026 06:35

eldenmoon reviewed Apr 13, 2026

View reviewed changes

Comment thread be/src/util/json/path_in_data.cpp Outdated

eldenmoon reviewed Apr 13, 2026

View reviewed changes

Comment thread be/src/storage/segment/variant/variant_column_reader.cpp Outdated

github-actions bot reviewed Apr 14, 2026

View reviewed changes

csun5285 force-pushed the fix/variant-dot-key-access branch from 06bb9c6 to 15f042f Compare April 14, 2026 12:12

github-actions bot requested changes Apr 14, 2026

View reviewed changes

Comment thread be/src/storage/segment/variant/variant_column_reader.cpp

csun5285 force-pushed the fix/variant-dot-key-access branch from 15f042f to c6d0982 Compare April 14, 2026 12:42

csun5285 closed this Apr 14, 2026

csun5285 reopened this Apr 14, 2026

github-actions bot reviewed Apr 14, 2026

View reviewed changes

Conversation

csun5285 commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

What problem does this PR solve?

Release note

Check List (For Author)

Check List (For Reviewer who merge this PR)

Uh oh!

Thearas commented Apr 12, 2026

Uh oh!

csun5285 commented Apr 12, 2026

Uh oh!

csun5285 commented Apr 13, 2026

Uh oh!

Uh oh!

Uh oh!

hello-stephen commented Apr 13, 2026

BE Regression && UT Coverage Report

Uh oh!

hello-stephen commented Apr 13, 2026

BE Regression && UT Coverage Report

Uh oh!

hello-stephen commented Apr 13, 2026

BE Regression && UT Coverage Report

Uh oh!

csun5285 commented Apr 14, 2026

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

hello-stephen commented Apr 14, 2026

BE Regression && UT Coverage Report

Uh oh!

csun5285 commented Apr 14, 2026

Uh oh!

csun5285 commented Apr 14, 2026

Uh oh!

csun5285 commented Apr 14, 2026

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

csun5285 commented Apr 14, 2026

Uh oh!

csun5285 commented Apr 14, 2026

Uh oh!

csun5285 commented Apr 14, 2026

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

hello-stephen commented Apr 14, 2026

BE Regression && UT Coverage Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

csun5285 commented Apr 12, 2026 •

edited

Loading