Skip to content

Optimise phrasescorer matches#15861

Open
iprithv wants to merge 1 commit intoapache:mainfrom
iprithv:optimize-phrasescorer-matches
Open

Optimise phrasescorer matches#15861
iprithv wants to merge 1 commit intoapache:mainfrom
iprithv:optimize-phrasescorer-matches

Conversation

@iprithv
Copy link
Copy Markdown

@iprithv iprithv commented Mar 21, 2026

Description

Optimise PhraseScorer by short-circuiting non-competitive documents in TOP_SCORES mode.

In PhraseScorer.TwoPhaseIterator.matches(), the matcher.reset() call is moved after the competitive score check. This allows skipping the expensive reset() work for documents whose maximum possible score falls below minCompetitiveScore.

Why this matters

reset() is expensive because it:

  • Calls PostingsEnum.freq() which decodes a full PFOR block per term (see Lucene104PostingsReader)
  • For exact phrases: initialises position iteration state
  • For sloppy phrases: rebuilds a priority queue via initPhrasePositions()

By checking maxFreq() first and short-circuiting when the document can't be competitive, all of this work is avoided.

Implementation

To support calling maxFreq() before reset(), both ExactPhraseMatcher and SloppyPhraseMatcher now load term frequencies eagerly inside maxFreq() and track this via a boolean freqsLoaded flag:

  • When reset() sees freqsLoaded == true (maxFreq was called), it skips redundant freq() calls
  • When maxFreq() was not called (non-TOP_SCORES mode, or before minCompetitiveScore is established), reset() follows the original single-loop code path with zero added overhead

The boolean flag approach avoids the cost of a virtual approximation.docID() call that would be needed with a lastDocId guard.

Testing

  • All 89 phrase-related tests pass
  • All org.apache.lucene.search.* tests pass
  • Added TestPhraseMatcherContract: directly tests that maxFreq() can be called before reset() for both exact and sloppy matchers
  • Added JMH benchmark (PhraseScorerBenchmark) with 3 forks, 5 warmup iterations, 5s measurement intervals

Benchmarking

JMH benchmark (1M docs, 0.1% phrase matches, 50% docs with terms but no phrase):

Benchmark Baseline (main) Candidate (this PR) Delta
benchmarkExactTopScores 2.767 ± 0.181 ops/ms 2.907 ± 0.261 ops/ms +5.1%
benchmarkSloppyTopScores 0.070 ± 0.002 ops/ms 0.071 ± 0.004 ops/ms +1.4%

@github-actions github-actions bot added this to the 11.0.0 milestone Mar 21, 2026
@iprithv iprithv force-pushed the optimize-phrasescorer-matches branch 3 times, most recently from 8646c87 to b2b6c29 Compare March 24, 2026 19:01
…n TOP_SCORES mode

In PhraseScorer.TwoPhaseIterator.matches(), move matcher.reset() after the
competitive score check so that non-competitive documents skip the expensive
reset work entirely. reset() calls PostingsEnum.freq() which decodes a full
PFOR block per term, plus initialises position iteration state (and for
sloppy phrases, rebuilds a priority queue). Skipping this for documents
whose maximum possible score is below minCompetitiveScore avoids significant
wasted work.

To support calling maxFreq() before reset(), both ExactPhraseMatcher and
SloppyPhraseMatcher now load term frequencies eagerly inside maxFreq() and
track this via a boolean flag. When reset() sees freqs are already loaded,
it skips redundant freq() calls. When maxFreq() was not called (non-TOP_SCORES
mode, or before minCompetitiveScore is established), reset() follows the
original single-loop code path with zero added overhead.

Also improves the JMH benchmark parameters (3 forks, 5 warmup iterations,
5s measurement intervals) for more statistically reliable results.
@iprithv iprithv force-pushed the optimize-phrasescorer-matches branch from b2b6c29 to ce9e8f5 Compare March 24, 2026 19:09
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant