Skip to content

Implement ANALYZE TABLE support #20852

@jonathanc-n

Description

@jonathanc-n

Is your feature request related to a problem or challenge?

ANALYZE TABLE command is used to calculate statistics. Spark's documentation can be seen here

This can allow us to do things like:

  • Accurately calculating distinct values (with the choice of just counting everything or using a hyperloglog)
  • Allowing the user to recalculate statistics and calculate them upfront instead of lazily if suspected statistics have gotten stale.
  • Memtable stat calculations

Describe the solution you'd like

Something similar to spark's implementation makes sense

Syntax:

ANALYZE TABLE table_identifier 
  COMPUTE STATISTICS [ NOSCAN | FOR COLUMNS col [, ...] | FOR ALL COLUMNS ]

COMPUTE STATISTICS (nothing else after it): Only calculate table stats (num_rows + total_byte_size)
NOSCAN: Don't probe/scan any values - just return
FOR COLUMNS col [, ...]: table stats + calculating specified columns stats (min, max, null, etc.)
FOR ALL COLUMNS: All column stats are calculated (min, max, null, etc.)

  • when specified for specific columns; when it is under a certain amount of columns + row count. A direct computation of distinct count can be done.

  • Maybe histogram support, this will need to be discussed if anything

  • Maybe a way to persist statistics

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions