Optimise phrasescorer matches by iprithv · Pull Request #15861 · apache/lucene

iprithv · 2026-03-21T16:03:31Z

Description

Optimise PhraseScorer by short-circuiting non-competitive documents in TOP_SCORES mode.

In PhraseScorer.TwoPhaseIterator.matches(), the matcher.reset() call is moved after the competitive score check. This allows skipping the expensive reset() work for documents whose maximum possible score falls below minCompetitiveScore.

Why this matters

reset() is expensive because it:

Calls PostingsEnum.freq() which decodes a full PFOR block per term (see Lucene104PostingsReader)
For exact phrases: initialises position iteration state
For sloppy phrases: rebuilds a priority queue via initPhrasePositions()

By checking maxFreq() first and short-circuiting when the document can't be competitive, all of this work is avoided.

Implementation

To support calling maxFreq() before reset(), both ExactPhraseMatcher and SloppyPhraseMatcher now load term frequencies eagerly inside maxFreq() and track this via a boolean freqsLoaded flag:

When reset() sees freqsLoaded == true (maxFreq was called), it skips redundant freq() calls
When maxFreq() was not called (non-TOP_SCORES mode, or before minCompetitiveScore is established), reset() follows the original single-loop code path with zero added overhead

The boolean flag approach avoids the cost of a virtual approximation.docID() call that would be needed with a lastDocId guard.

Testing

All 89 phrase-related tests pass
All org.apache.lucene.search.* tests pass
Added TestPhraseMatcherContract: directly tests that maxFreq() can be called before reset() for both exact and sloppy matchers
Added JMH benchmark (PhraseScorerBenchmark) with 3 forks, 5 warmup iterations, 5s measurement intervals

Benchmarking

JMH benchmark (1M docs, 0.1% phrase matches, 50% docs with terms but no phrase):

Benchmark	Baseline (main)	Candidate (this PR)	Delta
benchmarkExactTopScores	2.767 ± 0.181 ops/ms	2.907 ± 0.261 ops/ms	+5.1%
benchmarkSloppyTopScores	0.070 ± 0.002 ops/ms	0.071 ± 0.004 ops/ms	+1.4%

…n TOP_SCORES mode In PhraseScorer.TwoPhaseIterator.matches(), move matcher.reset() after the competitive score check so that non-competitive documents skip the expensive reset work entirely. reset() calls PostingsEnum.freq() which decodes a full PFOR block per term, plus initialises position iteration state (and for sloppy phrases, rebuilds a priority queue). Skipping this for documents whose maximum possible score is below minCompetitiveScore avoids significant wasted work. To support calling maxFreq() before reset(), both ExactPhraseMatcher and SloppyPhraseMatcher now load term frequencies eagerly inside maxFreq() and track this via a boolean flag. When reset() sees freqs are already loaded, it skips redundant freq() calls. When maxFreq() was not called (non-TOP_SCORES mode, or before minCompetitiveScore is established), reset() follows the original single-loop code path with zero added overhead. Also improves the JMH benchmark parameters (3 forks, 5 warmup iterations, 5s measurement intervals) for more statistically reliable results.

github-actions · 2026-04-08T00:41:19Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

github-actions bot added the module:core/search label Mar 21, 2026

github-actions bot added this to the 11.0.0 milestone Mar 21, 2026

iprithv force-pushed the optimize-phrasescorer-matches branch 3 times, most recently from 8646c87 to b2b6c29 Compare March 24, 2026 19:01

iprithv force-pushed the optimize-phrasescorer-matches branch from b2b6c29 to ce9e8f5 Compare March 24, 2026 19:09

github-actions bot added the Stale label Apr 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimise phrasescorer matches#15861

Optimise phrasescorer matches#15861
iprithv wants to merge 1 commit intoapache:mainfrom
iprithv:optimize-phrasescorer-matches

iprithv commented Mar 21, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

iprithv commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Why this matters

Implementation

Testing

Benchmarking

Uh oh!

github-actions bot commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

iprithv commented Mar 21, 2026 •

edited

Loading