[improvement](statistics) Collect hot value during full statistics analysis#62295
[improvement](statistics) Collect hot value during full statistics analysis#62295yujun777 wants to merge 1 commit intoapache:masterfrom
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
|
/review |
There was a problem hiding this comment.
Found 1 issue.
fe/fe-core/src/main/java/org/apache/doris/statistics/OlapAnalysisTask.java: the new full-analyze hot-value logic is only applied in the non-partition branch. Whenenable_partition_analyzeis enabled for a partitioned OLAP table,doFull()still delegates todoPartitionTable(), and that flow eventually merges partition stats withMERGE_PARTITION_TEMPLATE, which still hardcodesnull as hot_valueinBaseAnalysisTask. So the advertised behavior change is incomplete for a supported code path and full analyze still returns no hot values there. The new tests all forceenablePartitionAnalyze()tofalse, so they do not cover this case.
Critical checkpoint conclusions:
- Goal of the task: Partially met. Full analyze hot values are implemented for the direct full-scan path and external SQL path, but not for the partition-analyze path used by partitioned OLAP tables when
enable_partition_analyzeis on. Existing tests do not prove end-to-end coverage of that branch. - Minimality and clarity: The change is focused, but it misses a functionally parallel path (
doPartitionTable()/MERGE_PARTITION_TEMPLATE). - Concurrency: No new concurrency or locking risk identified in the modified code.
- Lifecycle/static initialization: No special lifecycle or static-init issue identified.
- Configuration: No new config is added, but behavior now depends on the existing
enable_partition_analyzeswitch; the change is not applied consistently across its enabled path. - Compatibility: No incompatible FE/BE protocol or storage-format change identified.
- Parallel code paths: Not fully handled; the partition-analyze full path still behaves differently from the non-partition full path.
- Special conditional checks: The partition/full branch condition is straightforward; the issue is the missing equivalent update behind the partition-enabled branch.
- Test coverage: Added unit and regression tests cover only the non-partition full path. Missing a regression/unit test with partition analyze enabled.
- Observability: No additional observability gap beyond the functional bug above.
- Transaction/persistence: No transaction or persistence issue identified in the modified areas.
- Data writes/modifications: The bug affects correctness of collected statistics (
hot_value) for partitioned full analyze, but I did not find atomicity or crash-safety issues in the changed code itself. - FE/BE variable passing: Not applicable here.
- Performance: The new query shape is heavier than the old one, but no obvious new hot-path regression beyond the intended extra aggregation.
- Other issues: None beyond the missing partition-path update above.
I did not run builds or tests in this review runner.
| doPartitionTable(); | ||
| } else { | ||
| StringSubstitutor stringSubstitutor = new StringSubstitutor(buildSqlParams()); | ||
| Map<String, String> params = buildSqlParams(); |
There was a problem hiding this comment.
This only fixes the non-partition full-analyze branch. When enable_partition_analyze is true and the table is partitioned, doFull() still goes through doPartitionTable(), which eventually merges with MERGE_PARTITION_TEMPLATE in BaseAnalysisTask and that template still hardcodes null as hot_value. In that configuration, ANALYZE TABLE ... WITH SYNC will continue to produce no hot values for partitioned OLAP tables, so the behavior change described in the PR is still incomplete. The new tests all stub enablePartitionAnalyze() to false, so they won't catch this path.
There was a problem hiding this comment.
we no handle with case enable_partition_analyze = true
|
run buildall |
…alysis ### What problem does this PR solve? Issue Number: close #xxx Problem Summary: Previously, full statistics collection (ANALYZE TABLE ... WITH SYNC) hardcoded `null as hot_value` in FULL_ANALYZE_TEMPLATE, meaning hot values (frequent values) were only collected during sample-based analysis. This made full analysis produce less complete statistics than sample analysis. This change rewrites FULL_ANALYZE_TEMPLATE to use a CTE-based structure (matching LINEAR_ANALYZE_TEMPLATE and DUJ1_ANALYZE_TEMPLATE patterns) that computes hot values via GROUP BY + TOP-N aggregation. Both OlapAnalysisTask and ExternalAnalysisTask are updated to pass the required parameters (hotValueCollectCount, subStringColName, rowCount2) to the template. ### Release note Full statistics collection now also collects hot value (frequent value) information, matching the behavior of sample-based collection. ### Check List (For Author) - Test: Regression test (test_hot_value, test_full_analyze_hot_value) / Unit Test (OlapAnalysisTaskTest, HMSAnalysisTaskTest) - Behavior changed: Yes - full analyze now produces hot_value instead of null - Does this need documentation: No Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
What problem does this PR solve?
Issue Number: close #xxx
Problem Summary: Previously, full statistics collection (
ANALYZE TABLE ... WITH SYNC)hardcoded
null as hot_valueinFULL_ANALYZE_TEMPLATE, meaning hot values (frequentvalues) were only collected during sample-based analysis. This made full analysis
produce less complete statistics than sample analysis.
This change rewrites
FULL_ANALYZE_TEMPLATEto use a CTE-based structure (matchingLINEAR_ANALYZE_TEMPLATEandDUJ1_ANALYZE_TEMPLATEpatterns) that computes hot valuesvia GROUP BY + TOP-N aggregation. Both
OlapAnalysisTaskandExternalAnalysisTaskareupdated to pass the required parameters (
hotValueCollectCount,subStringColName,rowCount2) to the template.Release note
Full statistics collection now also collects hot value (frequent value) information,
matching the behavior of sample-based collection.
Check List (For Author)