-
Notifications
You must be signed in to change notification settings - Fork 2k
Description
Is your feature request related to a problem or challenge?
ANALYZE TABLE command is used to calculate statistics. Spark's documentation can be seen here
This can allow us to do things like:
- Accurately calculating distinct values (with the choice of just counting everything or using a hyperloglog)
- Allowing the user to recalculate statistics and calculate them upfront instead of lazily if suspected statistics have gotten stale.
- Memtable stat calculations
Describe the solution you'd like
Something similar to spark's implementation makes sense
Syntax:
ANALYZE TABLE table_identifier
COMPUTE STATISTICS [ NOSCAN | FOR COLUMNS col [, ...] | FOR ALL COLUMNS ]COMPUTE STATISTICS (nothing else after it): Only calculate table stats (num_rows + total_byte_size)
NOSCAN: Don't probe/scan any values - just return
FOR COLUMNS col [, ...]: table stats + calculating specified columns stats (min, max, null, etc.)
FOR ALL COLUMNS: All column stats are calculated (min, max, null, etc.)
-
when specified for specific columns; when it is under a certain amount of columns + row count. A direct computation of distinct count can be done.
-
Maybe histogram support, this will need to be discussed if anything
-
Maybe a way to persist statistics
Describe alternatives you've considered
No response
Additional context
No response