[improvement](statistics) Collect hot value during full statistics analysis by yujun777 · Pull Request #62295 · apache/doris

yujun777 · 2026-04-09T11:55:11Z

What problem does this PR solve?

Issue Number: close #xxx

Problem Summary: Previously, full statistics collection (ANALYZE TABLE ... WITH SYNC)
hardcoded null as hot_value in FULL_ANALYZE_TEMPLATE, meaning hot values (frequent
values) were only collected during sample-based analysis. This made full analysis
produce less complete statistics than sample analysis.

This change rewrites FULL_ANALYZE_TEMPLATE to use a CTE-based structure (matching
LINEAR_ANALYZE_TEMPLATE and DUJ1_ANALYZE_TEMPLATE patterns) that computes hot values
via GROUP BY + TOP-N aggregation. Both OlapAnalysisTask and ExternalAnalysisTask are
updated to pass the required parameters (hotValueCollectCount, subStringColName,
rowCount2) to the template.

Release note

Full statistics collection now also collects hot value (frequent value) information,
matching the behavior of sample-based collection.

Check List (For Author)

Test: Regression test (test_hot_value, test_full_analyze_hot_value) / Unit Test (OlapAnalysisTaskTest, HMSAnalysisTaskTest)
Behavior changed: Yes - full analyze now produces hot_value instead of null
Does this need documentation: No

Thearas · 2026-04-09T11:55:17Z

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

What problem was fixed (it's best to include specific error reporting information). How it was fixed.
Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
What features were added. Why was this function added?
Which code was refactored and why was this part of the code refactored?
Which functions were optimized and what is the difference before and after the optimization?

yujun777 · 2026-04-09T11:55:23Z

run buildall

yujun777 · 2026-04-09T11:55:24Z

/review

github-actions

Found 1 issue.

fe/fe-core/src/main/java/org/apache/doris/statistics/OlapAnalysisTask.java: the new full-analyze hot-value logic is only applied in the non-partition branch. When enable_partition_analyze is enabled for a partitioned OLAP table, doFull() still delegates to doPartitionTable(), and that flow eventually merges partition stats with MERGE_PARTITION_TEMPLATE, which still hardcodes null as hot_value in BaseAnalysisTask. So the advertised behavior change is incomplete for a supported code path and full analyze still returns no hot values there. The new tests all force enablePartitionAnalyze() to false, so they do not cover this case.

Critical checkpoint conclusions:

Goal of the task: Partially met. Full analyze hot values are implemented for the direct full-scan path and external SQL path, but not for the partition-analyze path used by partitioned OLAP tables when enable_partition_analyze is on. Existing tests do not prove end-to-end coverage of that branch.
Minimality and clarity: The change is focused, but it misses a functionally parallel path (doPartitionTable() / MERGE_PARTITION_TEMPLATE).
Concurrency: No new concurrency or locking risk identified in the modified code.
Lifecycle/static initialization: No special lifecycle or static-init issue identified.
Configuration: No new config is added, but behavior now depends on the existing enable_partition_analyze switch; the change is not applied consistently across its enabled path.
Compatibility: No incompatible FE/BE protocol or storage-format change identified.
Parallel code paths: Not fully handled; the partition-analyze full path still behaves differently from the non-partition full path.
Special conditional checks: The partition/full branch condition is straightforward; the issue is the missing equivalent update behind the partition-enabled branch.
Test coverage: Added unit and regression tests cover only the non-partition full path. Missing a regression/unit test with partition analyze enabled.
Observability: No additional observability gap beyond the functional bug above.
Transaction/persistence: No transaction or persistence issue identified in the modified areas.
Data writes/modifications: The bug affects correctness of collected statistics (hot_value) for partitioned full analyze, but I did not find atomicity or crash-safety issues in the changed code itself.
FE/BE variable passing: Not applicable here.
Performance: The new query shape is heavier than the old one, but no obvious new hot-path regression beyond the intended extra aggregation.
Other issues: None beyond the missing partition-path update above.

I did not run builds or tests in this review runner.

github-actions · 2026-04-09T12:09:16Z

fe/fe-core/src/main/java/org/apache/doris/statistics/OlapAnalysisTask.java

            doPartitionTable();
        } else {
-            StringSubstitutor stringSubstitutor = new StringSubstitutor(buildSqlParams());
+            Map<String, String> params = buildSqlParams();


This only fixes the non-partition full-analyze branch. When enable_partition_analyze is true and the table is partitioned, doFull() still goes through doPartitionTable(), which eventually merges with MERGE_PARTITION_TEMPLATE in BaseAnalysisTask and that template still hardcodes null as hot_value. In that configuration, ANALYZE TABLE ... WITH SYNC will continue to produce no hot values for partitioned OLAP tables, so the behavior change described in the PR is still incomplete. The new tests all stub enablePartitionAnalyze() to false, so they won't catch this path.

we no handle with case enable_partition_analyze = true

yujun777 · 2026-04-09T14:55:31Z

run buildall

…alysis ### What problem does this PR solve? Issue Number: close #xxx Problem Summary: Previously, full statistics collection (ANALYZE TABLE ... WITH SYNC) hardcoded `null as hot_value` in FULL_ANALYZE_TEMPLATE, meaning hot values (frequent values) were only collected during sample-based analysis. This made full analysis produce less complete statistics than sample analysis. This change rewrites FULL_ANALYZE_TEMPLATE to use a CTE-based structure (matching LINEAR_ANALYZE_TEMPLATE and DUJ1_ANALYZE_TEMPLATE patterns) that computes hot values via GROUP BY + TOP-N aggregation. Both OlapAnalysisTask and ExternalAnalysisTask are updated to pass the required parameters (hotValueCollectCount, subStringColName, rowCount2) to the template. ### Release note Full statistics collection now also collects hot value (frequent value) information, matching the behavior of sample-based collection. ### Check List (For Author) - Test: Regression test (test_hot_value, test_full_analyze_hot_value) / Unit Test (OlapAnalysisTaskTest, HMSAnalysisTaskTest) - Behavior changed: Yes - full analyze now produces hot_value instead of null - Does this need documentation: No Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-actions bot reviewed Apr 9, 2026

View reviewed changes

yujun777 force-pushed the master branch from 28cfc9f to e380000 Compare April 9, 2026 12:42

yujun777 force-pushed the master branch from e380000 to 4b9481b Compare April 10, 2026 01:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[improvement](statistics) Collect hot value during full statistics analysis#62295

[improvement](statistics) Collect hot value during full statistics analysis#62295
yujun777 wants to merge 1 commit intoapache:masterfrom
yujun777:master

yujun777 commented Apr 9, 2026 •

edited by morrySnow

Loading

Uh oh!

Thearas commented Apr 9, 2026

Uh oh!

yujun777 commented Apr 9, 2026

Uh oh!

yujun777 commented Apr 9, 2026

Uh oh!

github-actions bot left a comment

Uh oh!

github-actions bot Apr 9, 2026

Uh oh!

yujun777 Apr 9, 2026

Uh oh!

yujun777 commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yujun777 commented Apr 9, 2026 • edited by morrySnow Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

Release note

Check List (For Author)

Uh oh!

Thearas commented Apr 9, 2026

Uh oh!

yujun777 commented Apr 9, 2026

Uh oh!

yujun777 commented Apr 9, 2026

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

yujun777 Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

yujun777 commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yujun777 commented Apr 9, 2026 •

edited by morrySnow

Loading