Skip to content

[improvement](statistics) Collect hot value during full statistics analysis#62295

Open
yujun777 wants to merge 1 commit intoapache:masterfrom
yujun777:master
Open

[improvement](statistics) Collect hot value during full statistics analysis#62295
yujun777 wants to merge 1 commit intoapache:masterfrom
yujun777:master

Conversation

@yujun777
Copy link
Copy Markdown
Contributor

@yujun777 yujun777 commented Apr 9, 2026

What problem does this PR solve?

Issue Number: close #xxx

Problem Summary: Previously, full statistics collection (ANALYZE TABLE ... WITH SYNC)
hardcoded null as hot_value in FULL_ANALYZE_TEMPLATE, meaning hot values (frequent
values) were only collected during sample-based analysis. This made full analysis
produce less complete statistics than sample analysis.

This change rewrites FULL_ANALYZE_TEMPLATE to use a CTE-based structure (matching
LINEAR_ANALYZE_TEMPLATE and DUJ1_ANALYZE_TEMPLATE patterns) that computes hot values
via GROUP BY + TOP-N aggregation. Both OlapAnalysisTask and ExternalAnalysisTask are
updated to pass the required parameters (hotValueCollectCount, subStringColName,
rowCount2) to the template.

Release note

Full statistics collection now also collects hot value (frequent value) information,
matching the behavior of sample-based collection.

Check List (For Author)

  • Test: Regression test (test_hot_value, test_full_analyze_hot_value) / Unit Test (OlapAnalysisTaskTest, HMSAnalysisTaskTest)
  • Behavior changed: Yes - full analyze now produces hot_value instead of null
  • Does this need documentation: No

@Thearas
Copy link
Copy Markdown
Contributor

Thearas commented Apr 9, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@yujun777
Copy link
Copy Markdown
Contributor Author

yujun777 commented Apr 9, 2026

run buildall

@yujun777
Copy link
Copy Markdown
Contributor Author

yujun777 commented Apr 9, 2026

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found 1 issue.

  1. fe/fe-core/src/main/java/org/apache/doris/statistics/OlapAnalysisTask.java: the new full-analyze hot-value logic is only applied in the non-partition branch. When enable_partition_analyze is enabled for a partitioned OLAP table, doFull() still delegates to doPartitionTable(), and that flow eventually merges partition stats with MERGE_PARTITION_TEMPLATE, which still hardcodes null as hot_value in BaseAnalysisTask. So the advertised behavior change is incomplete for a supported code path and full analyze still returns no hot values there. The new tests all force enablePartitionAnalyze() to false, so they do not cover this case.

Critical checkpoint conclusions:

  • Goal of the task: Partially met. Full analyze hot values are implemented for the direct full-scan path and external SQL path, but not for the partition-analyze path used by partitioned OLAP tables when enable_partition_analyze is on. Existing tests do not prove end-to-end coverage of that branch.
  • Minimality and clarity: The change is focused, but it misses a functionally parallel path (doPartitionTable() / MERGE_PARTITION_TEMPLATE).
  • Concurrency: No new concurrency or locking risk identified in the modified code.
  • Lifecycle/static initialization: No special lifecycle or static-init issue identified.
  • Configuration: No new config is added, but behavior now depends on the existing enable_partition_analyze switch; the change is not applied consistently across its enabled path.
  • Compatibility: No incompatible FE/BE protocol or storage-format change identified.
  • Parallel code paths: Not fully handled; the partition-analyze full path still behaves differently from the non-partition full path.
  • Special conditional checks: The partition/full branch condition is straightforward; the issue is the missing equivalent update behind the partition-enabled branch.
  • Test coverage: Added unit and regression tests cover only the non-partition full path. Missing a regression/unit test with partition analyze enabled.
  • Observability: No additional observability gap beyond the functional bug above.
  • Transaction/persistence: No transaction or persistence issue identified in the modified areas.
  • Data writes/modifications: The bug affects correctness of collected statistics (hot_value) for partitioned full analyze, but I did not find atomicity or crash-safety issues in the changed code itself.
  • FE/BE variable passing: Not applicable here.
  • Performance: The new query shape is heavier than the old one, but no obvious new hot-path regression beyond the intended extra aggregation.
  • Other issues: None beyond the missing partition-path update above.

I did not run builds or tests in this review runner.

doPartitionTable();
} else {
StringSubstitutor stringSubstitutor = new StringSubstitutor(buildSqlParams());
Map<String, String> params = buildSqlParams();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only fixes the non-partition full-analyze branch. When enable_partition_analyze is true and the table is partitioned, doFull() still goes through doPartitionTable(), which eventually merges with MERGE_PARTITION_TEMPLATE in BaseAnalysisTask and that template still hardcodes null as hot_value. In that configuration, ANALYZE TABLE ... WITH SYNC will continue to produce no hot values for partitioned OLAP tables, so the behavior change described in the PR is still incomplete. The new tests all stub enablePartitionAnalyze() to false, so they won't catch this path.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we no handle with case enable_partition_analyze = true

@yujun777
Copy link
Copy Markdown
Contributor Author

yujun777 commented Apr 9, 2026

run buildall

…alysis

### What problem does this PR solve?

Issue Number: close #xxx

Problem Summary: Previously, full statistics collection (ANALYZE TABLE ... WITH SYNC)
hardcoded `null as hot_value` in FULL_ANALYZE_TEMPLATE, meaning hot values (frequent
values) were only collected during sample-based analysis. This made full analysis
produce less complete statistics than sample analysis.

This change rewrites FULL_ANALYZE_TEMPLATE to use a CTE-based structure (matching
LINEAR_ANALYZE_TEMPLATE and DUJ1_ANALYZE_TEMPLATE patterns) that computes hot values
via GROUP BY + TOP-N aggregation. Both OlapAnalysisTask and ExternalAnalysisTask are
updated to pass the required parameters (hotValueCollectCount, subStringColName,
rowCount2) to the template.

### Release note

Full statistics collection now also collects hot value (frequent value) information,
matching the behavior of sample-based collection.

### Check List (For Author)

- Test: Regression test (test_hot_value, test_full_analyze_hot_value) / Unit Test (OlapAnalysisTaskTest, HMSAnalysisTaskTest)
- Behavior changed: Yes - full analyze now produces hot_value instead of null
- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants