Skip to content

[fix](profile) restore InvertedIndexQueryTime sub-timer invariants in SEARCH path#62301

Open
airborne12 wants to merge 2 commits intoapache:masterfrom
airborne12:search-profile-fix-timer-hierarchy
Open

[fix](profile) restore InvertedIndexQueryTime sub-timer invariants in SEARCH path#62301
airborne12 wants to merge 2 commits intoapache:masterfrom
airborne12:search-profile-fix-timer-hierarchy

Conversation

@airborne12
Copy link
Copy Markdown
Member

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

The SEARCH path (FunctionSearch::evaluate_inverted_index_with_search_param) currently lays out its profile sub-timers in a way that breaks two invariants that already hold in the MATCH path (FullTextIndexReader::query):

(a) InvertedIndexQueryTime        ≈ InvertedIndexSearcherOpenTime
                                  + InvertedIndexSearcherSearchTime
(b) InvertedIndexSearcherSearchTime ≈ InvertedIndexSearcherSearchInitTime
                                    + InvertedIndexSearcherSearchExecTime

Observed on a production SEARCH query scanning 237 segments (cumulative):

InvertedIndexFilterTime:            1min1sec
InvertedIndexQueryTime:             8sec617ms
InvertedIndexSearcherOpenTime:      5sec698ms
InvertedIndexSearcherSearchTime:    8sec612ms
InvertedIndexSearcherSearchInitTime:5sec704ms
InvertedIndexSearcherSearchExecTime:62.42us
  • (a) fails: SearchTime + OpenTime = 14310ms, far exceeds QueryTime = 8617ms. Root cause: OpenTime is scoped inside FieldReaderResolver::resolve, which is called from build_query_recursive, which was itself wrapped by SearchInitTimeSearchTime. So OpenTime was nested inside SearchTime, not a sibling, and adding them as siblings double-counts the open phase.
  • (b) fails: SearchTime − SearchInitTime − SearchExecTime ≈ 2.9 s of completely untimed work. Root cause: weight->scorer(...) — which is where per-term posting-list iterators are actually opened — lived between the old SearchInitTime (only wrapping build_query_recursive) and SearchExecTime (only wrapping the result-collection loop), so it was not attributed to any sub-timer.

This PR restructures the execution path in evaluate_inverted_index_with_search_param to mirror the MATCH path semantics: QueryTime wraps everything; inside QueryTime, OpenTime and SearchTime are siblings; inside SearchTime, SearchInitTime and SearchExecTime are siblings.

Changes in be/src/exprs/function/function_search.cpp:

  • Move build_query_recursive(...) outside the search_timer scope. It still opens searchers via FieldReaderResolver::resolve, so OpenTime accumulates as before — it is simply no longer nested inside SearchTime.
  • Redefine SearchInitTime to wrap Query::weight(...) — the SEARCH analog of query->add(query_info) in MATCH: preparing the executable query structure.
  • Extend SearchExecTime to wrap the collect_multi_segment_* iteration (where weight->scorer(...) opens posting-list iterators) plus the lazy null-bitmap fetch that follows.
  • Move InvertedIndexResultBitmap construction, mask_out_null(), and the DSL cache insert outside SearchTime — still inside QueryTime — mirroring how FullTextIndexReader::query does cache->insert(...) only after match_index_search returns.

Complementary diagnostic in SegmentIterator::_get_row_ranges_by_column_conditions:

InvertedIndexFilterTime now has three sibling child timers so that FilterTime ≈ ApplyColPred + ApplyExpr + PostFilter holds by construction, making the FilterTime ↔ QueryTime gap always explainable from the profile alone:

  • InvertedIndexApplyColPredTime — wraps _apply_inverted_index()
  • InvertedIndexApplyExprTime — wraps _apply_index_expr() (contains InvertedIndexQueryTime)
  • InvertedIndexPostFilterTime — wraps post-filter bitmap &= + per-column _check_all_conditions_passed_inverted_index_for_column

Also in be/src/exprs/vsearch.cpp (VSearchExpr::evaluate_inverted_index):

  • Demote the per-segment LOG(INFO) / LOG(WARNING) on benign early-exit paths to VLOG_DEBUG. At production scan rates these add measurable cumulative wall-clock to FilterTime while contributing nothing useful at INFO level.
  • Pass the result bitmap to set_index_result_for_expr via std::move to avoid a per-segment deep copy of the underlying roaring bitmap.

Verification:

After the fix, small-test profile now shows:

QueryTime:             603.740us
  OpenTime:              297.914us   \_ sum = 452.68us ≈ QueryTime (drift ~fixed overhead)
  SearchTime:            154.771us   /
    SearchInitTime:       74.950us   \_ sum = 151.19us ≈ SearchTime (2.3% drift)
    SearchExecTime:       76.236us   /

Projected on the production profile: QueryTime ≈ OpenTime (5.7s) + SearchTime (2.9s) ≈ 8.6s, invariant (a) restored; SearchTime ≈ SearchInitTime (2.9s) + SearchExecTime (~4ms), invariant (b) restored; the 2.9 s of previously-untimed scorer construction is now attributed to SearchInitTime.

Release note

Fix InvertedIndexSearcher*Time profile sub-timer hierarchy in the search() function path so that QueryTime ≈ OpenTime + SearchTime and SearchTime ≈ SearchInitTime + SearchExecTime, matching MATCH-path semantics. Also add three sub-timers (InvertedIndexApplyColPredTime, InvertedIndexApplyExprTime, InvertedIndexPostFilterTime) under InvertedIndexFilterTime so that the filter block's wall time is fully attributed.

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason

    Unit test added: be/test/storage/segment/segment_iterator_filter_sub_timer_test.cpp — pins both invariants by running SCOPED_RAW_TIMER on the OlapReaderStatistics fields in the same nesting shape as the production code and asserting parent ≥ sum(children) with a sub-ms drift bound. Any future regression that re-nests OpenTime inside SearchTime or drops a sub-timer around the scorer-construction phase will flip these assertions.

    Manual test: rebuilt BE, deployed to a local cloud-sim cluster, ran SELECT id FROM t_search_metrics WHERE search('content:doris') ORDER BY id with enable_profile=true profile_level=2, verified the new counters appear and that QueryTime ≈ OpenTime + SearchTime and SearchTime ≈ SearchInitTime + SearchExecTime.

  • Behavior changed:

    • No.
    • Yes.

    This PR only changes profile-timer scope placement and two non-functional micro-optimizations (LOG → VLOG_DEBUG, one std::move). The query result is unchanged.

  • Does this need documentation?

    • No.
    • Yes.

… SEARCH path

The SEARCH path (`FunctionSearch::evaluate_inverted_index_with_search_param`)
recently grew `inverted_index_searcher_{open,search,search_init,search_exec}_timer`
instrumentation, but with a scope layout that breaks two profile invariants
that already hold in the MATCH path:

  (a) InvertedIndexQueryTime        ≈ InvertedIndexSearcherOpenTime
                                    + InvertedIndexSearcherSearchTime
  (b) InvertedIndexSearcherSearchTime ≈ InvertedIndexSearcherSearchInitTime
                                      + InvertedIndexSearcherSearchExecTime

Observed on a production SEARCH query scanning 237 segments:

  InvertedIndexFilterTime:        1min1sec
  InvertedIndexQueryTime:         8sec617ms
  InvertedIndexSearcherOpenTime:  5sec698ms
  InvertedIndexSearcherSearchTime:8sec612ms
  InvertedIndexSearcherSearchInitTime: 5sec704ms
  InvertedIndexSearcherSearchExecTime: 62.42us

  (a) fails: SearchTime + OpenTime = 14310ms, far exceeds QueryTime 8617ms
      (because OpenTime is nested inside SearchInitTime inside SearchTime,
       so adding them as siblings double-counts the open phase).
  (b) fails: SearchTime - SearchInitTime - SearchExecTime = 2908ms of
      completely untimed work (the `weight->scorer(...)` call, which is
      where the per-term posting-list iterators are actually opened, was
      outside any sub-timer).

Fix: restructure the execution path in
`evaluate_inverted_index_with_search_param` to mirror MATCH path semantics,
i.e. `FullTextIndexReader::query`:

  QueryTime wraps everything; inside QueryTime, OpenTime and SearchTime are
  siblings; inside SearchTime, SearchInitTime and SearchExecTime are
  siblings.

- Move `build_query_recursive(...)` OUTSIDE the `search_timer` scope. The
  call still opens searchers via `FieldReaderResolver::resolve`, so OpenTime
  accumulates as before — it's just no longer nested inside SearchTime.
- Define SearchInitTime to wrap `Query::weight(...)` (preparing the
  executable query structure — the SEARCH analog of `query->add(query_info)`
  in MATCH).
- Define SearchExecTime to wrap the `collect_multi_segment_*` iteration
  (which is where `weight->scorer(...)` opens posting-list iterators) plus
  the lazy null-bitmap fetch that follows.
- Move `InvertedIndexResultBitmap` construction, `mask_out_null()`, and the
  DSL cache insert OUTSIDE SearchTime — still inside QueryTime — matching
  how `FullTextIndexReader::query` does `cache->insert(...)` only after
  `match_index_search` returns.

As a complementary diagnostic, `InvertedIndexFilterTime` now has three
sibling child timers in `SegmentIterator::_get_row_ranges_by_column_conditions`:

  InvertedIndexApplyColPredTime  — `_apply_inverted_index()`
  InvertedIndexApplyExprTime     — `_apply_index_expr()`
                                   (contains InvertedIndexQueryTime)
  InvertedIndexPostFilterTime    — post-filter bitmap `&=` + per-column
                                   `_check_all_conditions_passed_...`

so that FilterTime ≈ ApplyColPred + ApplyExpr + PostFilter holds by
construction, making the FilterTime ↔ QueryTime gap always explainable
from the profile alone.

Also in `VSearchExpr::evaluate_inverted_index`:

- Demote the per-segment `LOG(INFO)` / `LOG(WARNING)` on benign early-exit
  paths to `VLOG_DEBUG`. At production scan rates these add measurable
  cumulative wall-clock to FilterTime while contributing nothing useful
  at INFO level.
- Pass the result bitmap to `set_index_result_for_expr` via `std::move`
  to avoid a per-segment deep copy of the underlying roaring bitmap.

New unit test `segment_iterator_filter_sub_timer_test.cpp` pins both
invariants by running `SCOPED_RAW_TIMER` on the `OlapReaderStatistics`
fields in the same nesting shape as the production code and asserting
`parent ≥ sum(children)` with a sub-ms drift bound. Any future regression
that re-nests OpenTime inside SearchTime or drops a sub-timer around the
scorer-construction phase will flip these assertions.
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@airborne12
Copy link
Copy Markdown
Member Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 86.60% (84/97) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.87% (26725/37186)
Line Coverage 54.86% (283131/516082)
Region Coverage 51.79% (234008/451828)
Branch Coverage 53.28% (101266/190049)

…uption via op_not aliasing

The previous commit included a "micro-optimization" in
`VSearchExpr::evaluate_inverted_index`:

    index_context->set_index_result_for_expr(this, std::move(result_bitmap));

intending to skip a per-segment deep copy of the underlying roaring bitmap.

This is wrong: at the point of that call, the DSL result cache
(`InvertedIndexQueryCache`) already holds a `shared_ptr` to the same
underlying `roaring::Roaring` as `result_bitmap._data_bitmap`
(`function_search.cpp::evaluate_inverted_index_with_search_param` inserts
into the cache *before* the function returns). The previous lvalue
`set_index_result_for_expr(this, result_bitmap)` worked precisely because
the parameter is `InvertedIndexResultBitmap` by value, so its copy
constructor deep-copies the Roaring and isolates the per-segment expr-
context map entry from the cache.

With `std::move`, the per-segment map entry instead aliases the cached
Roaring. `VCompoundPred::COMPOUND_NOT` then calls
`InvertedIndexResultBitmap::op_not()`, which is declared `const` but
mutates `*_data_bitmap` in place (the const-ness only applies to the
shared_ptr member, not the pointee). That mutation corrupts the DSL
cache entry, so when the *next* segment hits the cache it observes the
*negated* roaring instead of the original SEARCH match. The same hazard
exists for `op_and` / `op_or` (both also mutate through the shared_ptr).

Symptoms — apache#62301 CI, both
`Doris_DorisRegression_P0Regression` and `Doris_DorisRegression_CloudP0`,
4 new test failures for the queries that go through compound NOT/AND
on top of `search()`:

  - search/test_search_null_regression           (line 115:
    `WHERE not search('content:Round', '{"mode":"standard"}')`)
  - search/test_search_dsl_syntax                (line 305: NOT search
    + IS NOT NULL)
  - search/test_search_vs_match_consistency      (line 200: AND with NULL)
  - search/test_search_not_null_bitmap           (line 79: NOT search
    + count agg)

For each, the test got rows that the SEARCH side clearly matches (e.g.
row 3 of `search_null_regression_test`, whose `content` is "A story about
Ronald McDonald at the Round table"), proving the cached search result
had been negated by an earlier compound NOT.

Reverting to the lvalue form restores deep-copy isolation between the
DSL cache and the per-segment expr-context bitmap. Verified on a local
cluster: after reverting, `WHERE not search('content:Round', ...)`
returns the expected rows {1, 2, 5, 11, 13} and stays stable across
repeated invocations (previously it drifted as the cache got corrupted
on segment-2+).

Note: this `op_not`/`op_and`/`op_or` mutation pattern through a `const`
shared_ptr accessor is itself fragile. A separate, larger refactor to
make those operators non-const (or to deep-copy in the operator) would
be the right long-term fix; this patch is the minimal revert that
restores correctness without changing the operator API.
@airborne12
Copy link
Copy Markdown
Member Author

run buildall

The previous CI run flagged 4 "new" failures in Doris_DorisRegression_P0Regression and Doris_DorisRegression_CloudP0, all in search tests that exercise compound NOT/AND on top of search():

  • search/test_search_null_regression (line 115)
  • search/test_search_dsl_syntax (line 305)
  • search/test_search_vs_match_consistency (line 200)
  • search/test_search_not_null_bitmap (line 79)

Root cause: the previous commit included a std::move(result_bitmap) micro-optimization in VSearchExpr::evaluate_inverted_index. At that point the DSL result cache (InvertedIndexQueryCache) already holds a shared_ptr to the same underlying roaring::Roaring as result_bitmap._data_bitmap. Moving (instead of lvalue-copying) into set_index_result_for_expr makes the per-segment expr-context map entry alias the cached Roaring, and VCompoundPred::COMPOUND_NOT then calls InvertedIndexResultBitmap::op_not() — which is declared const but writes through the shared_ptr (*_data_bitmap = *universe - *_data_bitmap - *_null_bitmap). That mutation corrupts the cached bitmap, so the next segment that hits the cache observes the negated SEARCH result.

Fixed in b53104b: revert to the lvalue form, so the value-by-copy parameter of set_index_result_for_expr deep-copies the Roaring and isolates the cache from per-segment mutations.

Verified locally: WHERE not search('content:Round', '{"mode":"standard"}') returns the expected rows {1, 2, 5, 11, 13} and stays stable across 5 repeated invocations (previously drifted as the cache got corrupted on segment-2+).

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 33.33% (32/96) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.03% (20136/37971)
Line Coverage 36.57% (189296/517664)
Region Coverage 32.82% (146947/447695)
Branch Coverage 33.94% (64309/189467)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 45.83% (44/96) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 57.52% (21388/37186)
Line Coverage 40.55% (209248/516082)
Region Coverage 36.76% (166098/451828)
Branch Coverage 37.64% (71538/190049)

1 similar comment
@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 45.83% (44/96) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 57.52% (21388/37186)
Line Coverage 40.55% (209248/516082)
Region Coverage 36.76% (166098/451828)
Branch Coverage 37.64% (71538/190049)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 45.83% (44/96) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 57.78% (21487/37186)
Line Coverage 40.86% (210849/516082)
Region Coverage 37.20% (168069/451828)
Branch Coverage 38.05% (72305/190049)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 64.58% (62/96) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 63.71% (23693/37186)
Line Coverage 47.23% (243739/516082)
Region Coverage 44.10% (199265/451828)
Branch Coverage 45.44% (86362/190049)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants