Skip to content

[SPARK-56406][SS] Stream-stream join v4: skip writing secondary index if the operator will not evict from that side#55271

Open
HeartSaVioR wants to merge 2 commits intoapache:masterfrom
HeartSaVioR:SPARK-56406
Open

[SPARK-56406][SS] Stream-stream join v4: skip writing secondary index if the operator will not evict from that side#55271
HeartSaVioR wants to merge 2 commits intoapache:masterfrom
HeartSaVioR:SPARK-56406

Conversation

@HeartSaVioR
Copy link
Copy Markdown
Contributor

@HeartSaVioR HeartSaVioR commented Apr 9, 2026

What changes were proposed in this pull request?

This PR proposes to skip writing secondary index if the operator will not evict from that side. For simplicity, we keep creating column family and just skip writing to secondary index.

The way to understand whether the operator won't evict from that side is following:

It's very obvious for the case where both sides do not have event time column - both sides do not evict any state at all.

The tricky case is when one side has event time column and another side can deduce from it to evict the state row.

  • equality join (event time column is in join key): non-watermarked side can actually evict the state.
  • time-interval join (event time column is in value side): "neither" side can actually evict the state since one side is unbound and other side has to be relative with it.

For the former, the logic is able to detect the ability for non-watermarked side and write secondary index for it. (See how joinKeyOrdinalForWatermark is constructed.)
For the latter, technically, we can skip "both" sides to skip writing secondary index, but that's fairly minor case and the logic only enables non-watermarked side to skip writing secondary index. (The PR left the potential optimization as TODO code comment, but we are not going to file a JIRA ticket since we don't know whether we ever demand it.)

The main coverage of this optimization is a regular join where both sides do not have event time column; the coverage of watermark on only one side is a sort of bonus.

For safety net, evict*** methods will raise an exception if the operator has skipped writing secondary index, since it is NOT expected for these methods to be called.

Why are the changes needed?

There are several cases where the operator never leverages the secondary index on one (or both) side(s), so there is no value to write to the secondary index.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New UT.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude 4.6 Opus

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant