TopN statistics are collected unnecessarily for non-skewed values on small tables

Question

TopN statistics are collected unnecessarily for non-skewed values on small tables

terry1purcell opened this issue a month ago · comments

Enhancement

Current ANALYZE will collect by default the top 500 values for an interesting column (indexed column or predicate column). For tables that are sampled, then there is pruning logic to remove statistics that are not skewed. For tables that are NOT sampled (smaller tables) - the pruning logic is not invoked.

Below shows an example where values are collected with a count of 1 - which are not skewed.

tidb> show stats_topn;
+---------+------------+----------------+-------------+----------+------------+-------+
| Db_name | Table_name | Partition_name | Column_name | Is_index | Value      | Count |
+---------+------------+----------------+-------------+----------+------------+-------+
| test    | t2         |                | a           |        0 | 73         |    16 |
| test    | t2         |                | a           |        0 | 74         |    16 |
| test    | t2         |                | a           |        0 | 75         |    16 |
| test    | t2         |                | a           |        0 | 76         |    16 |
......
| test    | t2         |                | a           |        0 | 101        |     1 |
| test    | t2         |                | a           |        0 | 102        |     1 |
| test    | t2         |                | a           |        0 | 103        |     1 |
| test    | t2         |                | a           |        0 | 104        |     1 |
......

Elsa · Answer 1 · Fri May 31 2024 11:10:42 GMT+0800 (China Standard Time)

Do we already support the pruning TopN logic in small table (sampled table) ?