Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DBT-3 optimization #74

Open
greenlion opened this issue May 26, 2021 · 2 comments
Open

DBT-3 optimization #74

greenlion opened this issue May 26, 2021 · 2 comments
Labels
BETA2 Target for beta #2 performance Issue is related to storage engine performance

Comments

@greenlion
Copy link
Owner

Start working on the DBT-3 at scale factor 1. Problems with each query will be added to this list as comments.

This is a meta bug of sorts.

@greenlion
Copy link
Owner Author

greenlion commented May 26, 2021

Query 1

FIXED
Note: Most recent check in adds support for AVG so that this query works with parallel query!

where l_shipdate <= '1998-08-26' - interval 2 day

| -> Sort: lineitem.l_returnflag, lineitem.l_linestatus (actual time=13486.608..13486.609 rows=4 loops=1)
-> Table scan on (actual time=0.001..0.001 rows=4 loops=1)
-> Aggregate using temporary table (actual time=13486.587..13486.587 rows=4 loops=1)
-> Table scan on lineitem (cost=600121.50 rows=6001215) (actual time=101.027..10571.454 rows=5898833 loops=1)
|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (13.50 sec)

select
	l_returnflag,
	l_linestatus,
	sum(l_quantity) as sum_qty,
	sum(l_extendedprice) as sum_base_price,
	sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
	sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
	avg(l_quantity) as avg_qty,
	avg(l_extendedprice) as avg_price,
	avg(l_discount) as avg_disc,
	count(*) as count_order
from
	lineitem
where
	l_shipdate <= date_sub('1998-12-01', interval 97 day)
group by
	l_returnflag,
	l_linestatus
order by
	l_returnflag,
	l_linestatus;

Problem: ECP isn't filtering rows
-> Sort: lineitem.l_returnflag, lineitem.l_linestatus (actual time=15966.831..15966.832 rows=4 loops=1)
-> Table scan on (actual time=0.001..0.001 rows=4 loops=1)
-> Aggregate using temporary table (actual time=15966.807..15966.808 rows=4 loops=1)
-> Filter: (lineitem.l_shipDATE <= (('1998-12-01' - interval 97 day))) (cost=600121.50 rows=2000205) (actual time=87.119..12979.178 rows=5902892 loops=1)
-> Table scan on lineitem (cost=600121.50 rows=6001215) (actual time=87.116..12404.760 rows=6001215 loops=1)

Optimized version:

select
	l_returnflag,
	l_linestatus,
	sum(l_quantity) as sum_qty,
	sum(l_extendedprice) as sum_base_price,
	sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
	sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
	avg(l_quantity) as avg_qty,
	avg(l_extendedprice) as avg_price,
	avg(l_discount) as avg_disc,
	count(*) as count_order
from
	lineitem
where
	l_shipdate <= '1998-08-26'
group by
	l_returnflag,
	l_linestatus
order by
	l_returnflag,
	l_linestatus;

EXPLAIN: -> Sort: lineitem.l_returnflag, lineitem.l_linestatus (actual time=15451.847..15451.848 rows=4 loops=1)
-> Table scan on (actual time=0.001..0.001 rows=4 loops=1)
-> Aggregate using temporary table (actual time=15451.820..15451.821 rows=4 loops=1)
-> Table scan on lineitem (cost=600121.50 rows=6001215) (actual time=104.259..12324.562 rows=5902892 loops=1)

Note that the filtered number of rows is lower on the second optimized query. ECP isn't working with date_sub. Note cached evaluation in the first explain that is pushed down (not noted) in the second explain.

Also note that it is almost a FTS and so the index doesn't make any difference but it will likely in other DBT-3 queries, unless they all examine the whole lineitem table!

Turning off row visibility and transaction visibility checks shaves one second off the query. I don't think that is worth optimizing at this time. Turning off row visibility decreases count(*) query time by 2 tenths of a second (.92 with .76 without)

@greenlion
Copy link
Owner Author

Query 2

select
	s_acctbal,
	s_name,
	n_name,
	p_partkey,
	p_mfgr,
	s_address,
	s_phone,
	s_comment
from
	part,
	supplier,
	partsupp,
	nation,
	region
where
	p_partkey = ps_partkey
	and s_suppkey = ps_suppkey
	and p_size = 14
	and p_type like '%TIN'
	and s_nationkey = n_nationkey
	and n_regionkey = r_regionkey
	and r_name = 'AMERICA'
	and ps_supplycost = (
		select
			min(ps_supplycost)
		from
			partsupp,
			supplier,
			nation,
			region
		where
			p_partkey = ps_partkey
			and s_suppkey = ps_suppkey
			and s_nationkey = n_nationkey
			and n_regionkey = r_regionkey
			and r_name = 'AMERICA'
	)
order by
	s_acctbal desc,
	n_name,
	s_name,
	p_partkey
LIMIT 100;```
-> Limit: 100 row(s)  (actual time=46414.522..46414.544 rows=100 loops=1)
    -> Sort: supplier.s_acctbal DESC, nation.n_name, supplier.s_name, part.p_partkey, limit input to 100 row(s) per chunk  (actual time=46414.521..46414.540 rows=100 loops=1)
        -> Stream results  (cost=100372.45 rows=617) (actual time=161.047..46413.755 rows=155 loops=1)
            -> Inner hash join (part.p_partkey = partsupp.ps_partkey), (partsupp.ps_supplycost = (select #2))  (cost=100372.45 rows=617) (actual time=161.043..46413.163 rows=155 loops=1)
                -> Filter: ((part.p_size = 14) and (part.p_type like '%TIN'))  (cost=39.64 rows=222) (actual time=0.040..19.192 rows=806 loops=1)
                    -> Table scan on part  (cost=39.64 rows=200000) (actual time=0.017..13.420 rows=4153 loops=1)
                -> Hash
                    -> Inner hash join (supplier.s_nationkey = nation.n_nationkey), (supplier.s_suppkey = partsupp.ps_suppkey)  (cost=25026.69 rows=2500) (actual time=78.946..105.967 rows=3432 loops=1)
                        -> Table scan on supplier  (cost=2.00 rows=10000) (actual time=4.565..29.905 rows=10000 loops=1)
                        -> Hash
                            -> Inner hash join (nation.n_regionkey = region.r_regionkey)  (cost=26.69 rows=25) (actual time=53.164..58.589 rows=83060 loops=1)
                                -> Table scan on nation  (cost=0.13 rows=25) (actual time=0.392..0.473 rows=25 loops=1)
                                -> Hash
                                    -> Inner hash join (no condition)  (cost=1.68 rows=10) (actual time=41.800..51.059 rows=16612 loops=1)
                                        -> Table scan on partsupp  (cost=0.33 rows=2) (actual time=41.117..48.776 rows=16612 loops=1)
                                        -> Hash
                                            -> Table scan on region  (cost=0.55 rows=5) (actual time=0.651..0.665 rows=1 loops=1)

Runs in 44 seconds if I turn on statistics munging and make partsupp have a row count of 2.
Query does not complete without this optimization.
Query requires bitmap index join optimization.

@greenlion greenlion added BETA2 Target for beta #2 performance Issue is related to storage engine performance labels Nov 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BETA2 Target for beta #2 performance Issue is related to storage engine performance
Projects
None yet
Development

No branches or pull requests

1 participant