Skip to content

[fix](fe) Fix VSlotRef invalid slot id in INTERSECT/EXCEPT with ExprId reuse#62296

Open
yujun777 wants to merge 1 commit intoapache:masterfrom
yujun777:fix/intersect-exprid-reuse
Open

[fix](fe) Fix VSlotRef invalid slot id in INTERSECT/EXCEPT with ExprId reuse#62296
yujun777 wants to merge 1 commit intoapache:masterfrom
yujun777:fix/intersect-exprid-reuse

Conversation

@yujun777
Copy link
Copy Markdown
Contributor

@yujun777 yujun777 commented Apr 9, 2026

What problem does this PR solve?

Issue Number: close #CIR-19889

Problem Summary:
When the Nereids optimizer produces set operations (INTERSECT/EXCEPT/UNION DISTINCT) where ExprIds are reused between:

  1. A set operation's output and its child's output (parent-child sharing)
  2. Multiple children's outputs (sibling sharing)

The plan translator in visitPhysicalSetOperation() would translate children's result expressions and distribute expressions AFTER all children had been visited. Since each child visit calls generateTupleDesc() which overwrites the ExprId→SlotRef mapping via HashMap.put(), later children overwrite earlier children's mappings. This caused the translated result expressions to reference wrong slots, leading to BE errors: VSlotRef have invalid slot id.

The fix interleaves child visits with result/distribute expression translation — translating each child's expressions immediately after visiting that child, before the next child's visit can overwrite the ExprId→SlotRef mapping.

Reproducing query:

WITH
  tbl0(`n`) AS (SELECT 1 UNION ALL SELECT 3 UNION ALL SELECT NULL),
  tbl1(`n`) AS (SELECT 2 UNION ALL SELECT NULL UNION ALL SELECT 1)
(
  SELECT (`n` * 2) AS `n` FROM tbl0
  INTERSECT
  SELECT (`n` * 2) AS `n` FROM tbl1
);
-- Error: [INTERNAL_ERROR]VSlotRef have invalid slot id: 3
-- Expected: 2

Release note

Fixed a bug where INTERSECT/EXCEPT queries with CTEs and expressions could fail with "VSlotRef have invalid slot id" error when the optimizer reuses ExprIds across set operation children.

Check List (For Author)

  • Test: Regression test added (set_operation_exprid_reuse.groovy)
  • Behavior changed: No
  • Does this need documentation: No

@Thearas
Copy link
Copy Markdown
Contributor

Thearas commented Apr 9, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@yujun777
Copy link
Copy Markdown
Contributor Author

yujun777 commented Apr 9, 2026

run buildall

@yujun777
Copy link
Copy Markdown
Contributor Author

yujun777 commented Apr 9, 2026

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  1. High - The fix is incomplete because the same ExprId->SlotRef overwrite pattern still exists in the functionally parallel recursive CTE translator path (visitPhysicalRecursiveUnion()), so recursive unions can still bind child result expressions to the wrong slots when ExprIds are reused.

Critical checkpoints

  • Goal of current task: Partially achieved. visitPhysicalSetOperation() is fixed and covered by regression tests, but the same translator invariant is still violated in visitPhysicalRecursiveUnion().
  • Modification size/focus: Yes, the patch is small and focused.
  • Concurrency: Not involved.
  • Special lifecycle/static init: No special lifecycle concerns in the changed code.
  • Configuration: None added.
  • Compatibility: No FE/BE protocol or storage compatibility change.
  • Parallel code paths: Not fully covered. visitPhysicalRecursiveUnion() is a parallel path that still uses the old translation order.
  • Special conditional checks: The new ordering comment is clear; no extra conditional-risk issue found.
  • Test coverage: Added good regression coverage for non-recursive set operations, but there is no recursive CTE regression for the same ExprId-reuse invariant.
  • Observability: No additional observability needed for this translator-only fix.
  • Transaction/persistence: Not involved.
  • Data writes/modifications: Not involved.
  • FE/BE variable passing: Not involved.
  • Performance: Acceptable; the change is still linear and does not add meaningful overhead.
  • Other issues: None beyond the missed recursive CTE path.

@Override
public PlanFragment visitPhysicalSetOperation(
PhysicalSetOperation setOperation, PlanTranslatorContext context) {
// Translate each child's result exprs and distribute exprs immediately after
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High: this only fixes visitPhysicalSetOperation(), but visitPhysicalRecursiveUnion() still follows the old order: visit both children, call generateTupleDesc(recursiveCte.getOutput(), ...), then translate recursiveCte.getRegularChildrenOutputs() (see lines 2302-2328 in the current file). That path uses the same global PlanTranslatorContext.exprIdToSlotRef map, so later child visits and the parent tuple creation still overwrite earlier mappings. Because LogicalRecursiveUnion.buildNewOutputs() can preserve an existing output ExprId, a recursive CTE with reused ExprIds can still translate to the wrong SlotRef and hit the same invalid-slot failure. Please apply the same interleaving strategy here and add a regression under recursive_cte.

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 100.00% (12/12) 🎉
Increment coverage report
Complete coverage report

…d reuse

Issue Number: close #CIR-19889

Problem Summary:
When the Nereids optimizer produces set operations (INTERSECT/EXCEPT/UNION DISTINCT)
where ExprIds are reused between:
1. A set operation output and its child output (parent-child sharing)
2. Multiple children outputs (sibling sharing)

The plan translator in visitPhysicalSetOperation() would translate children result
expressions and distribute expressions AFTER all children had been visited. Since
each child visit calls generateTupleDesc() which overwrites the ExprId→SlotRef
mapping via HashMap.put(), later children overwrite earlier children mappings.
This caused the translated result expressions to reference wrong slots, leading to
BE errors: "VSlotRef have invalid slot id".

The fix interleaves child visits with result/distribute expression translation,
translating each child expressions immediately after visiting that child, before
the next child visit can overwrite the ExprId→SlotRef mapping.

Fixed a bug where INTERSECT/EXCEPT queries with CTEs and expressions could fail
with "VSlotRef have invalid slot id" error when the optimizer reuses ExprIds
across set operation children.

- Test: Regression test added (set_operation_exprid_reuse.groovy)
- Behavior changed: No
- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@yujun777 yujun777 force-pushed the fix/intersect-exprid-reuse branch from bf4213e to 8aa64f0 Compare April 9, 2026 14:48
@yujun777
Copy link
Copy Markdown
Contributor Author

yujun777 commented Apr 9, 2026

run buildall

@yujun777
Copy link
Copy Markdown
Contributor Author

yujun777 commented Apr 9, 2026

/review

@yujun777
Copy link
Copy Markdown
Contributor Author

yujun777 commented Apr 9, 2026

Re: review comment on visitPhysicalRecursiveUnion() — Applied the same interleaving fix. Result exprs and distribute exprs are now translated immediately after each child visit, before generateTupleDesc() can overwrite the ExprId→SlotRef mappings. Verified with recursive CTE tests.

Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One issue found.

  1. regression-test/data/query_p0/set_operations/set_operation_exprid_reuse.out encodes the wrong expected result for intersect_nulls. Doris already treats NULL as equal for set operations: the existing query_p0/set_operations/set_with_null suite expects INTERSECT ... SELECT NULL to return \N, not to drop it. The new expected output only contains 1, so this regression test would bless a semantic regression instead of protecting against it.

Critical checkpoint conclusions:

  • Goal / correctness / test proof: The goal is to fix ExprId -> SlotRef overwrite during set-operation translation. The code change in PhysicalPlanTranslator.visitPhysicalSetOperation() does address that overwrite hazard by translating each child immediately after visiting it, and the added suite exercises the reported failing shapes. However, one expected result is incorrect, so the current tests do not yet prove the fix safely.
  • Scope / minimality: The code change is small and focused. Extending the same interleaving pattern to visitPhysicalRecursiveUnion() is also consistent with the same mutable-map mechanism.
  • Concurrency: No new concurrency or lock-safety concerns in the touched code path.
  • Lifecycle / static initialization: No special lifecycle or static-init concerns introduced here.
  • Config changes: None.
  • Compatibility: No FE/BE protocol or storage-format compatibility changes detected.
  • Parallel code paths: The analogous recursive CTE path was updated too, which is good. I did not find another remaining set-operation translation path with the same overwrite pattern in this file.
  • Special condition checks: The new comments explain the ExprId reuse hazard clearly enough.
  • Test coverage: Coverage improved for INTERSECT / EXCEPT / UNION DISTINCT and nested shapes, but the intersect_nulls oracle is inconsistent with existing Doris semantics and should be corrected. No recursive CTE regression was added for the newly touched parallel path.
  • Observability: No additional observability seems necessary for this FE translation fix.
  • Transaction / persistence: Not applicable.
  • Data writes / modifications: Not applicable beyond regression-test DDL/DML setup.
  • FE/BE variable passing: No new variables added.
  • Performance: The interleaving avoids extra retranslation hazards without adding meaningful overhead.
  • Other issues: None beyond the incorrect expected result above.

30 Z

-- !intersect_nulls --
1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

INTERSECT in Doris treats NULL as equal in set semantics. The existing query_p0/set_operations/set_with_null suite already expects SELECT k1 FROM d_table INTERSECT SELECT NULL to return \N. Recording only 1 here would bless a semantic regression in the new test rather than verify the translator fix. This case should include the NULL row as well.

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 100.00% (25/25) 🎉
Increment coverage report
Complete coverage report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants