PostgreSQL 索引推荐 - HypoPG , pg_qualstats

postgres=# create table st1(id int);    
CREATE TABLE    
postgres=# create table st2(id int);    
CREATE TABLE    
postgres=# insert into st1 select generate_series(1,1000000);    
INSERT 0 1000000    
postgres=# insert into st2 select generate_series(1,1000000);    
INSERT 0 1000000    
postgres=# select * from st1 where id=1;    
 id     
----    
  1    
(1 row)    
    
postgres=# select * from st2 limit 1;    
 id     
----    
  1    
(1 row)

st1 全表扫描一次，扫描了所有记录来过滤where id=1

st2 全表扫描一次，但是只扫描了1条记录，因为条件是limit 1

postgres=# select * from pg_stat_all_tables where relname='st1';    
-[ RECORD 1 ]-------+------------------------------    
relid               | 21423    
schemaname          | public    
relname             | st1    
seq_scan            | 1    
seq_tup_read        | 1000000    
idx_scan            |     
idx_tup_fetch       |     
n_tup_ins           | 1000000    
n_tup_upd           | 0    
n_tup_del           | 0    
n_tup_hot_upd       | 0    
n_live_tup          | 1000000    
n_dead_tup          | 0    
n_mod_since_analyze | 0    
last_vacuum         |     
last_autovacuum     |     
last_analyze        |     
last_autoanalyze    | 2019-08-04 10:19:45.443907+08    
vacuum_count        | 0    
autovacuum_count    | 0    
analyze_count       | 0    
autoanalyze_count   | 1    
    
postgres=# select * from pg_stat_all_tables where relname='st2';    
-[ RECORD 1 ]-------+------------------------------    
relid               | 21426    
schemaname          | public    
relname             | st2    
seq_scan            | 1    
seq_tup_read        | 1    
idx_scan            |     
idx_tup_fetch       |     
n_tup_ins           | 1000000    
n_tup_upd           | 0    
n_tup_del           | 0    
n_tup_hot_upd       | 0    
n_live_tup          | 1000000    
n_dead_tup          | 0    
n_mod_since_analyze | 0    
last_vacuum         |     
last_autovacuum     |     
last_analyze        |     
last_autoanalyze    | 2019-08-04 10:19:45.522465+08    
vacuum_count        | 0    
autovacuum_count    | 0    
analyze_count       | 0    
autoanalyze_count   | 1

所以st1有问题，st2没有问题。

2、索引扫描，平均每次扫描的记录多，占所有记录的比例高，说明用错了索引、或者random_page_cost参数不对（太低）、或者禁止了全表扫描、或者使用hint强制使用了索引、或者使用了绑定变量数据倾斜导致使用了错误的plan。等。

postgres=# create index idx_st1 on st1(id);    
CREATE INDEX    
postgres=# create index idx_st2 on st2(id);    
CREATE INDEX

st1使用全表扫描，过滤100万记录。

postgres=# explain select count(*) from st1 where id<1000000;    
-[ RECORD 1 ]----------------------------------------------------------------    
QUERY PLAN | Aggregate  (cost=19425.00..19425.01 rows=1 width=8)    
-[ RECORD 2 ]----------------------------------------------------------------    
QUERY PLAN |   ->  Seq Scan on st1  (cost=0.00..16925.00 rows=999999 width=0)    
-[ RECORD 3 ]----------------------------------------------------------------    
QUERY PLAN |         Filter: (id < 1000000)    
    
postgres=# select count(*) from st1 where id<1000000;    
-[ RECORD 1 ]-    
count | 999999

st2使用索引扫描，过滤100万记录，但是它是离散扫描。

postgres=# set enable_seqscan=off;    
SET    
postgres=# explain select count(*) from st2 where id<1000000;    
-[ RECORD 1 ]-------------------------------------------------------------------------------------    
QUERY PLAN | Aggregate  (cost=27445.01..27445.01 rows=1 width=8)    
-[ RECORD 2 ]-------------------------------------------------------------------------------------    
QUERY PLAN |   ->  Index Only Scan using idx_st2 on st2  (cost=0.42..24945.01 rows=999999 width=0)    
-[ RECORD 3 ]-------------------------------------------------------------------------------------    
QUERY PLAN |         Index Cond: (id < 1000000)    
    
postgres=# select count(*) from st2 where id<1000000;    
-[ RECORD 1 ]-    
count | 999999

从统计信息可以看到st2的索引扫描，扫描的记录太多，说明索引过滤下不好，这个索引创建或使用的有问题。

postgres=# select * from pg_stat_all_tables where relname='st1';    
-[ RECORD 1 ]-------+------------------------------    
relid               | 21423    
schemaname          | public    
relname             | st1    
seq_scan            | 5    
seq_tup_read        | 3000000    
idx_scan            | 2    
idx_tup_fetch       | 2    
n_tup_ins           | 1000000    
n_tup_upd           | 0    
n_tup_del           | 0    
n_tup_hot_upd       | 0    
n_live_tup          | 1000000    
n_dead_tup          | 0    
n_mod_since_analyze | 0    
last_vacuum         |     
last_autovacuum     |     
last_analyze        |     
last_autoanalyze    | 2019-08-04 10:19:45.443907+08    
vacuum_count        | 0    
autovacuum_count    | 0    
analyze_count       | 0    
autoanalyze_count   | 1    
    
postgres=# select * from pg_stat_all_tables where relname='st2';    
-[ RECORD 1 ]-------+------------------------------    
relid               | 21426    
schemaname          | public    
relname             | st2    
seq_scan            | 4    
seq_tup_read        | 1000001    
idx_scan            | 3    
idx_tup_fetch       | 1000001    
n_tup_ins           | 1000000    
n_tup_upd           | 0    
n_tup_del           | 0    
n_tup_hot_upd       | 0    
n_live_tup          | 1000000    
n_dead_tup          | 0    
n_mod_since_analyze | 0    
last_vacuum         |     
last_autovacuum     |     
last_analyze        |     
last_autoanalyze    | 2019-08-04 10:19:45.522465+08    
vacuum_count        | 0    
autovacuum_count    | 0    
analyze_count       | 0    
autoanalyze_count   | 1

实现

了解原理之后，我们会发现还少了点东西：

1、我们并不知道数据库的哪些SQL是TOP SQL ，因为解决TOP SQL的话，性能提速最明显。

2、我们只知道表有性能问题，并不知道是哪些SQL，哪些过滤条件引起的。

3、即使创建了索引，我们还需要知道性能提升有多少。

在PostgreSQL 生态中，有很多插件来支撑：

1、TOP SQL：pg_stat_statements

2、pg_qualstats，SQL的真实过滤性、选择性统计(where, join)，用于判断是否需要索引

3、HypoPG，虚拟索引（用于索引推荐），通过虚拟执行计划，对比后可以知道性能提升有多少。

有了这几个东西就好办了。

参考如下文档

https://www.percona.com/blog/2019/07/22/automatic-index-recommendations-in-postgresql-using-pg_qualstats-and-hypopg/

In my previous blog post, we have seen how the extension hypopg can be helpful in creating hypothetical indexes in PostgreSQL. If you have read it already, you know that it is very useful in verifying whether an Index can improve the performance of an SQL without having to create it in reality. Considering this, is there also a way to automatically suggest which indexes can improve the performance of some of your SQL’s? The answer is… YES!

In this blog post, we will see how we can get a direct recommendation to improve a specific query, as seen in the following snippet.

                       query                       |                 recmnded_index                 | percent_improvd     
---------------------------------------------------+------------------------------------------------+-----------------    
 select * from foo.bar where id2 = $1 and id4 = $2 | CREATE INDEX ON foo.bar USING btree (id2, id4) |           99.96    
 select * from foo.bar where id3 = $1              | CREATE INDEX ON foo.bar USING btree (id3)      |           99.93    
(2 rows)

Without any further delay, let’s discuss the extension pg_qualstats which enables us to achieve this requirement for PostgreSQL versions 9.4 or later. Following that, we will take a look at a logic which could automatically suggest what indexes would be helpful for query optimizations – without much manual work.

All of the logic discussed in this blog post is reproducible, so please feel free to do so using the commands and try tuning your custom queries in test environments.

pg_qualstats

pg_qualstats is an extension developed by the POWA Team to uncover the need for storing predicates (quals). It stores the predicates found in WHERE clauses and JOIN conditions. This helps us analyze a query execution and opens up the possibility of automatic query optimizations.

When we query pg_stat_statements like extensions, we only see the prepared SQL or the SQL without any bind variable values. As you cannot perform an EXPLAIN (to see the execution plan of a Query) without the quals, the only option available is to see if that query was logged in the PostgreSQL log file and then identify the parameters passed to it. Or you can maybe use some arbitrary values, but this requires some manual intervention and time. But, when you create this extension: pg_qualstats, it stores queries along with the actual quals based on the sample rate ( pg_qualstats.sample_rate ) specified.

Creating the extension: pg_qualstats

For RedHat/CentOS, we can install it using the packages available in the PGDG repository. Once you have added the PGDG repo, simply run the following command:

# yum install pg_qualstats11

Similarly for Ubuntu/Debian:

# apt install postgresql-11-pg-qualstats

Once installed, you must add pg_qualstats to shared_preload_libraries . This requires a restart. As I am also using pg_stat_statements to get the queryid associated with each query recorded by pg_qualstats, I have the following setting in my postgresql.conf file:

shared_preload_libraries = 'pg_stat_statements, pg_qualstats'

Modifying the above parameter requires a restart of your PostgreSQL instance.

Some GUCs you should know

In order to start taking advantage of this extension, you may have to set some of the GUCs (Grand Unified Configuration) in your PostgreSQL server. These can be set using ALTER SYSTEM or by manually adding the associated entries in your postgresql.conf or postgresql.auto.conf files.

pg_qualstats.enabled : true or false (to enable or to disable pg_qualstats ). ON by default.

pg_qualstats.track_constants : true or false (to enable tracking of each constant. False would reduce the number of entries to track predicates.)

pg_qualstats.max : The number of queries tracked. Defaults to 1000.

pg_qualstats.resolve_oids : Just store the oids or resolve them and store at query time. This takes additional space.

pg_qualstats.track_pg_catalog : Defaults to false. Whether or not the predicates of the objects in pg_catalog schema should be computed.

pg_qualstats.sample_rate : Default is -1. The fraction of queries to be sampled. -1 defaults to (1/max_connections). When set to 1, everything is sampled. Similarly when set to 0.1, one out of 10 queries are sampled.

Quals and Query Examples

In order to see this in action, let us use sysbench-tpcc to generate some SQL traffic, and then see some of the details captured.

Before running sysbench-tpcc , I have created all the required extensions as seen in the following log. In order to see the queryid (same as the queryid column of pg_stat_statements ) associated with each qual captured, it is important to have the extension: pg_stat_statements created. Similarly, to create hypothetical indexes, we need to have the extension: hypopg created.

percona=# CREATE EXTENSION hypopg;    
CREATE EXTENSION    
percona=# CREATE EXTENSION pg_stat_statements ;    
CREATE EXTENSION    
percona=# CREATE EXTENSION pg_qualstats;    
CREATE EXTENSION    
percona=# \dx    
                                     List of installed extensions    
        Name        | Version |   Schema   |                        Description                            
--------------------+---------+------------+-----------------------------------------------------------    
 hypopg             | 1.1.3   | public     | Hypothetical indexes for PostgreSQL    
 pg_qualstats       | 1.0.8   | public     | An extension collecting statistics about quals    
 pg_stat_statements | 1.6     | public     | track execution statistics of all SQL statements executed    
 plpgsql            | 1.0     | pg_catalog | PL/pgSQL procedural language    
(4 rows)    
    
percona=# show shared_preload_libraries ;    
     shared_preload_libraries         
----------------------------------    
 pg_stat_statements, pg_qualstats    
(1 row)

For this test, I have set pg_qualstats.sample_rate to 1. Setting this to 1 captured every qual. We can, of course, reduce the sampling to ensure not everything is captured.

$ psql -d percona -c "ALTER SYSTEM SET pg_qualstats.sample_rate TO 1"     
ALTER SYSTEM    
    
$ psql -c "select pg_reload_conf()"    
 pg_reload_conf     
----------------    
 t    
(1 row)

Installing and running sysbench-tpcc on RedHat/CentOS

$ sudo yum install https://repo.percona.com/yum/percona-release-latest.noarch.rpm    
$ sudo yum install git sysbench    
$ git clone https://github.com/Percona-Lab/sysbench-tpcc.git    
$ cd sysbench-tpcc    
$ ./tpcc.lua --pgsql-user=postgres --pgsql-port=5432 --pgsql-db=percona --time=10 --threads=2 --report-interval=1 --tables=2 --scale=2 --use_fk=0  --trx_level=RC --db-driver=pgsql prepare    
$ ./tpcc.lua --pgsql-user=postgres --pgsql-port=5432 --pgsql-db=percona --time=10 --threads=2 --report-interval=1 --tables=2 --scale=2 --use_fk=0  --trx_level=RC --db-driver=pgsql run

After running the benchmark for 10 seconds, we are now ready to query the view: pg_qualstats_indexes that gives us a nice view of the columns on which there are no indexes.

percona=# select * from pg_qualstats_indexes;    
    relid    |   attnames   |      possible_types      | execution_count     
-------------+--------------+--------------------------+-----------------    
 customer2   | {c_id}       | {brin,btree,hash}        |               4    
 customer2   | {c_last}     | {brin,btree,hash,spgist} |              33    
 customer1   | {c_id}       | {brin,btree,hash}        |               7    
 customer1   | {c_last}     | {brin,btree,hash,spgist} |               8    
 orders2     | {o_c_id}     | {brin,btree,hash}        |               2    
 order_line1 | {ol_o_id}    | {brin,btree}             |             208    
 order_line2 | {ol_o_id}    | {brin,btree}             |             202    
 order_line2 | {ol_o_id}    | {brin,btree,hash}        |              20    
 stock1      | {s_quantity} | {brin,btree}             |             208    
 stock2      | {s_quantity} | {brin,btree}             |             202    
(10 rows)

Automatic Index Recommendations

In the previous section, we have seen the columns (of tables used in some of the SQLs in WHERE condition) which do not have indexes on them. Indexes on these columns can be considered as recommended indexes. To make this more meaningful, I have written the following function to store the queries that are using these columns as predicates, along with their execution plans before and after creating the recommended index. Using this data, we can understand whether the recommended index is really helpful.

CREATE OR REPLACE FUNCTION find_usable_indexes()    
RETURNS VOID AS    
$$    
DECLARE    
    l_queries     record;    
    l_querytext     text;    
    l_idx_def       text;    
    l_bef_exp       text;    
    l_after_exp     text;    
    hypo_idx      record;    
    l_attr        record;    
    /* l_err       int; */    
BEGIN    
    CREATE TABLE IF NOT EXISTS public.idx_recommendations (queryid bigint,     
    query text, current_plan jsonb, recmnded_index text, hypo_plan jsonb);    
    FOR l_queries IN    
    SELECT t.relid, t.relname, t.queryid, t.attnames, t.attnums,     
    pg_qualstats_example_query(t.queryid) as query    
      FROM     
        (     
         SELECT qs.relid::regclass AS relname, qs.relid AS relid, qs.queryid,     
         string_agg(DISTINCT attnames.attnames,',') AS attnames, qs.attnums    
         FROM pg_qualstats_all qs    
         JOIN pg_qualstats q ON q.queryid = qs.queryid    
         JOIN pg_stat_statements ps ON q.queryid = ps.queryid    
         JOIN pg_amop amop ON amop.amopopr = qs.opno    
         JOIN pg_am ON amop.amopmethod = pg_am.oid,    
         LATERAL     
              (     
               SELECT pg_attribute.attname AS attnames    
               FROM pg_attribute    
               JOIN unnest(qs.attnums) a(a) ON a.a = pg_attribute.attnum     
               AND pg_attribute.attrelid = qs.relid    
               ORDER BY pg_attribute.attnum) attnames,         
         LATERAL unnest(qs.attnums) attnum(attnum)    
               WHERE NOT     
               (    
                EXISTS     
                      (     
                       SELECT 1    
                       FROM pg_index i    
                       WHERE i.indrelid = qs.relid AND     
                       (arraycontains((i.indkey::integer[])[0:array_length(qs.attnums, 1) - 1],     
                        qs.attnums::integer[]) OR arraycontains(qs.attnums::integer[],     
                        (i.indkey::integer[])[0:array_length(i.indkey, 1) + 1]) AND i.indisunique)))    
                       GROUP BY qs.relid, qs.queryid, qs.qualnodeid, qs.attnums) t    
                       GROUP BY t.relid, t.relname, t.queryid, t.attnames, t.attnums                       
    LOOP    
        /* RAISE NOTICE '% : is queryid',l_queries.queryid; */    
        execute 'explain (FORMAT JSON) '||l_queries.query INTO l_bef_exp;    
        execute 'select hypopg_reset()';    
        execute 'SELECT indexrelid,indexname FROM hypopg_create_index(''CREATE INDEX on '||l_queries.relname||'('||l_queries.attnames||')'')' INTO hypo_idx;          
        execute 'explain (FORMAT JSON) '||l_queries.query INTO l_after_exp;    
        execute 'select hypopg_get_indexdef('||hypo_idx.indexrelid||')' INTO l_idx_def;    
        INSERT INTO public.idx_recommendations (queryid,query,current_plan,recmnded_index,hypo_plan)     
        VALUES (l_queries.queryid,l_querytext,l_bef_exp::jsonb,l_idx_def,l_after_exp::jsonb);            
    END LOOP;        
        execute 'select hypopg_reset()';    
END;    
$$ LANGUAGE plpgsql;

The above function uses the following logic:

Create a Table with name: public.idx_recommendations where the results are stored. It stores the queries on which the table and column names mentioned in the output of pg_qualstats_indexes are used as predicates, along with their execution plan before and after creating the hypothethical indexes .
Get the list of Queries (candidates for query tuning) along with their queryid and the attributes on which an index is recommended for each query. The SQL in the above FOR LOOP is built using a slight modification to the existing view: pg_qualstats_indexes

 relid |   relname   |  queryid   |  attnames  | attnums |                   query                        
-------+-------------+------------+------------+---------+--------------------------------------------    
 17725 | customer2   |  297872607 | c_id       | {1}     | UPDATE customer2 ....    
 17725 | customer2   |  702831032 | c_id       | {1}     | UPDATE customer2 ....    
 17725 | customer2   | 1509701064 | c_last     | {6}     | SELECT count(c_id) namecnt ....    
 17725 | customer2   | 1539164311 | c_id       | {1}     | SELECT c_discount, c_last, c_credit, w_t    
 17725 | customer2   | 1976730265 | c_last     | {6}     | SELECT c_id FROM customer2 ......    
 17725 | customer2   | 2041891134 | c_id       | {1}     | SELECT c_first, c_middle, c_last, c_stre..    
 17728 | customer1   |  850567043 | c_id       | {1}     | SELECT c_first, c_middle, c_last, c_stre..    
 17728 | customer1   |  977223112 | c_last     | {6}     | SELECT count(c_id) namecnt ....    
 17728 | customer1   | 2618115206 | c_id       | {1}     | SELECT c_discount, c_last, c_credit, w_t..    
 17728 | customer1   | 2965820579 | c_last     | {6}     | SELECT c_id FROM customer1 ....    
 17728 | customer1   | 3738483437 | c_id       | {1}     | UPDATE customer1 ....    
 17752 | orders2     | 2217779140 | o_id       | {1}     | SELECT o_c_id ....    
 17752 | orders2     | 2709941400 | o_id       | {1}     | UPDATE orders2 ....    
 17762 | new_orders2 | 3012757930 | no_o_id    | {1}     | DELETE FROM new_orders2 ....    
 17771 | order_line2 |  192474146 | ol_o_id    | {1}     | SELECT COUNT(DISTINCT (s_i_id)) ....    
 17771 | order_line2 |  313117619 | ol_o_id    | {1}     | UPDATE order_line2 ....    
 17771 | order_line2 | 2921207531 | ol_o_id    | {1}     | SELECT SUM(ol_amount) sm ....    
 17782 | stock2      |  192474146 | s_quantity | {3}     | SELECT COUNT(DISTINCT (s_i_id)) ....    
(18 rows)

An example query with predicates can be obtained using the function : pg_qualstats_example_query() provided by pg_qualstats.

percona=# select pg_qualstats_example_query(297872607);    
                          pg_qualstats_example_query                               
-------------------------------------------------------------------------------    
 UPDATE customer2                                                             +    
                         SET c_balance=-1576.000000, c_ytd_payment=1576.000000+    
                       WHERE c_w_id = 2                                       +    
                         AND c_d_id=9                                         +    
                         AND c_id=900    
(1 row)

Run EXPLAIN on the example query to store it in the table: public.idx_recommendations.
Use hypopg to create a hypothetical index on the attributes mentioned as columns without an index on them.
Run EXPLAIN on the query again and use hypopg_reset() to drop the hypothetical index created.
When I have validated if any of the queries generated by sysbench-tpcc need some tuning using indexing, it was not a surprise that none of those need any further indexing. Hence, for the purpose of a demo, I have created a table and ran a few queries as following.

percona=# CREATE TABLE foo.bar (id int, dept int, id2 int, id3 int, id4 int, id5 int,id6 int,id7 int,details text, zipcode int);    
CREATE TABLE    
percona=# INSERT INTO foo.bar SELECT (random() * 1000000)::int, (random() * 1000000)::int, (random() * 1000000)::int,    
(random() * 1000000)::int,(random() * 1000000)::int,(random() * 1000000)::int,(random() * 1000000)::int,(random() * 1000000)::int,    
md5(g::text), floor(random()* (20000-9999 + 1) + 9999)     
FROM generate_series(1,1*1e6) g;    
INSERT 0 1000000    
percona=# CREATE INDEX idx_btree_bar ON foo.bar (id, dept, id2,id3,id4,id5,id6,zipcode);    
CREATE INDEX    
percona=# select * from foo.bar where id2 = 1 and id4 = 3;    
 id | dept | id2 | id3 | id4 | id5 | id6 | id7 | details | zipcode     
----+------+-----+-----+-----+-----+-----+-----+---------+---------    
(0 rows)    
percona=# select * from foo.bar where id3 = 3;    
   id   |  dept  |  id2   | id3 |  id4   | id5  |  id6   |  id7   |             details              | zipcode     
--------+--------+--------+-----+--------+------+--------+--------+----------------------------------+---------    
 876920 | 723557 | 670210 |   3 | 321924 | 9512 | 183549 | 756279 | 3e780bae1aacbebc10b1e06ca49d226e |   16364    
(1 row)

Now, let us check if we can find the indexes that could improve the SQL COST of execution in reality. As I have mentioned earlier, none of the queries run by sysbench-tpcc needed any further improvement through indexing, so we see the improvement only for the 2 select statements I ran above.

percona=# select find_usable_indexes();    
 find_usable_indexes     
---------------------    
     
(1 row)    
    
percona=# select queryid, current_plan->0->'Plan'->>'Total Cost' as "cost_without_index",     
hypo_plan->0->'Plan'->>'Total Cost' as "cost_with_index",     
round((((current_plan->0->'Plan'->>'Total Cost')::numeric-(hypo_plan->0->'Plan'->>'Total Cost')::numeric)*100/(current_plan->0->'Plan'->>'Total Cost')::numeric),2) as percent_improvd     
FROM idx_recommendations order by 4 desc;    
  queryid   | cost_without_index | cost_with_index | percent_improvd     
------------+--------------------+-----------------+-----------------    
  612880084 | 27346.00           | 8.07            |           99.97    
 1669974955 | 24846.00           | 12.08           |           99.95    
 1539164311 | 9.35               | 9.35            |            0.00    
 1976730265 | 15.37              | 15.37           |            0.00    
 2041891134 | 8.32               | 8.32            |            0.00    
 2750481779 | 8.31               | 8.31            |            0.00    
  850567043 | 8.32               | 8.32            |            0.00    
  946492786 | 8.32               | 8.32            |            0.00    
 2618115206 | 9.35               | 9.35            |            0.00    
  297872607 | 8.31               | 8.31            |            0.00    
 1170842645 | 8.31               | 8.31            |            0.00    
 1210386388 | 8.31               | 8.31            |            0.00    
 1101690903 | 39.52              | 39.52           |            0.00    
 4203200359 | 39.51              | 39.51           |            0.00    
  192474146 | 2289.00            | 2289.00         |            0.00    
  192474146 | 2289.00            | 2289.00         |            0.00    
 3738483437 | 8.31               | 8.31            |            0.00    
 1509701064 | 15.39              | 15.39           |            0.00    
(18 rows)    
    
percona=# select b.query, a.recmnded_index,round((((a.current_plan->0->'Plan'->>'Total Cost')::numeric-(hypo_plan->0->'Plan'->>'Total Cost')::numeric)*100/(a.current_plan->0->'Plan'->>'Total Cost')::numeric),2) as percent_improvd FROM idx_recommendations a JOIN pg_stat_statements b ON a.queryid = b.queryid WHERE round((((current_plan->0->'Plan'->>'Total Cost')::numeric-(hypo_plan->0->'Plan'->>'Total Cost')::numeric)*100/(current_plan->0->'Plan'->>'Total Cost')::numeric),2) > 0 order by 3 desc ;    
                       query                       |                 recmnded_index                 | percent_improvd     
---------------------------------------------------+------------------------------------------------+-----------------    
 select * from foo.bar where id2 = $1 and id4 = $2 | CREATE INDEX ON foo.bar USING btree (id2, id4) |           99.96    
 select * from foo.bar where id3 = $1              | CREATE INDEX ON foo.bar USING btree (id3)      |           99.93    
(2 rows)

As this function is storing the results into a table: public.idx_recommendations , we can query that and see the hypothetical index that has improved the total cost of that query.

percona=# select * from idx_recommendations WHERE queryid IN (612880084,1669974955);    
-[ RECORD 1 ]--+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------    
queryid        | 1669974955    
query          |     
current_plan   | [{"Plan": {"Alias": "bar", "Filter": "(id3 = 3)", "Node Type": "Seq Scan", "Plan Rows": 2, "Plan Width": 69, "Total Cost": 24846.00, "Startup Cost": 0.00, "Relation Name": "bar", "Parallel Aware": false}}]    
recmnded_index | CREATE INDEX ON foo.bar USING btree (id3)    
hypo_plan      | [{"Plan": {"Alias": "bar", "Node Type": "Index Scan", "Plan Rows": 2, "Index Cond": "(id3 = 3)", "Index Name": "<18208>btree_foo_bar_id3", "Plan Width": 69, "Total Cost": 12.08, "Startup Cost": 0.05, "Relation Name": "bar", "Parallel Aware": false, "Scan Direction": "Forward"}}]    
-[ RECORD 2 ]--+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------    
queryid        | 612880084    
query          |     
current_plan   | [{"Plan": {"Alias": "bar", "Filter": "((id2 = 1) AND (id4 = 3))", "Node Type": "Seq Scan", "Plan Rows": 1, "Plan Width": 69, "Total Cost": 27346.00, "Startup Cost": 0.00, "Relation Name": "bar", "Parallel Aware": false}}]    
recmnded_index | CREATE INDEX ON foo.bar USING btree (id2, id4)    
hypo_plan      | [{"Plan": {"Alias": "bar", "Node Type": "Index Scan", "Plan Rows": 1, "Index Cond": "((id2 = 1) AND (id4 = 3))", "Index Name": "<18207>btree_foo_bar_id2_id4", "Plan Width": 69, "Total Cost": 8.07, "Startup Cost": 0.05, "Relation Name": "bar", "Parallel Aware": false, "Scan Direction": "Forward"}}]

Conclusion

With this experiment, we see that we can use hypopg and pg_qualstats to automate index recommendations. The automation logic is currently limited to B-Tree Indexes only. Though it has a very negligible impact on performance through some minimalistic resource consumption, it can be considered by developers while coding an application logic. Developers could easily enable sampling for each query and see what indexes can be used to improve which query, and then implement the changes in Production. The function logic I have created above is just an experiment in automatic index recommendations and you may re-use the same upon additional testing. Special thanks again to POWA Team who have contributed to the very useful extension pg_qualstats in PostgreSQL.

参考

《PostgreSQL SQL自动优化案例 - 极简，自动推荐索引》

《PostgreSQL 虚拟|虚假索引(hypothetical index) - HypoPG》

《powa4 PostreSQL Workload Analyzer - PostgreSQL监控工具、带WEB展示》

《PostgreSQL 商用版本EPAS(阿里云ppas(Oracle 兼容版)) 索引推荐功能使用》

《PostgreSQL index include - 类聚簇表与应用(append only, IoT时空轨迹, 离散多行扫描与返回)》

《PostgreSQL 空间st_contains，st_within空间包含搜索优化 - 降IO和降CPU(bound box) (多边形GiST优化)》

《HTAP数据库 PostgreSQL 场景与性能测试之 47 - (OLTP) 空间应用 - 高并发空间位置更新、多属性KNN搜索并测（含空间索引）末端配送、新零售类项目》

《PostgreSQL Oracle 兼容性之 - performance insight - AWS performance insight 理念与实现解读 - 珍藏级》

https://www.percona.com/blog/2019/07/22/automatic-index-recommendations-in-postgresql-using-pg_qualstats-and-hypopg/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

20190804_02.md

20190804_02.md

PostgreSQL 索引推荐 - HypoPG , pg_qualstats

作者

日期

标签

背景

例子

原理

实现

参考如下文档

pg_qualstats

Creating the extension: pg_qualstats

Some GUCs you should know

Quals and Query Examples

Installing and running sysbench-tpcc on RedHat/CentOS

Automatic Index Recommendations

The above function uses the following logic:

Conclusion

参考

PostgreSQL 许愿链接

9.9元购买3个月阿里云RDS PostgreSQL实例

PostgreSQL 解决方案集合

德哥 / digoal's github - 公益是一辈子的事.

PolarDB 学习图谱: 训练营、培训认证、在线互动实验、解决方案、生态合作、写心得拿奖品

购买PolarDB云服务折扣活动进行中, 55元起

About 德哥

Files

20190804_02.md

Latest commit

History

20190804_02.md

File metadata and controls

PostgreSQL 索引推荐 - HypoPG , pg_qualstats

作者

日期

标签

背景

例子

原理

实现

参考如下文档

pg_qualstats

Creating the extension: pg_qualstats

Some GUCs you should know

Quals and Query Examples

Installing and running sysbench-tpcc on RedHat/CentOS

Automatic Index Recommendations

The above function uses the following logic:

Conclusion

参考

PostgreSQL 许愿链接

9.9元购买3个月阿里云RDS PostgreSQL实例

PostgreSQL 解决方案集合

德哥 / digoal's github - 公益是一辈子的事.

PolarDB 学习图谱: 训练营、培训认证、在线互动实验、解决方案、生态合作、写心得拿奖品

购买PolarDB云服务折扣活动进行中, 55元起

About 德哥