Skip to content
This repository has been archived by the owner on Jan 29, 2020. It is now read-only.

PostgresqlMetadata works very-very slowly #199

Open
Sohorev opened this issue Dec 7, 2016 · 12 comments · May be fixed by #321
Open

PostgresqlMetadata works very-very slowly #199

Sohorev opened this issue Dec 7, 2016 · 12 comments · May be fixed by #321

Comments

@Sohorev
Copy link

Sohorev commented Dec 7, 2016

in the class Zend\Db\Metadata\Source\PostgresqlMetadata method loadTableNameData works very slowly.
It generate query:

SELECT "t"."table_name", "t"."table_type", "v"."view_definition", "v"."check_option", "v"."is_updatable" 
FROM "information_schema"."tables" t 
LEFT JOIN "information_schema"."views" v ON "t"."table_schema" = "v"."table_schema" AND "t"."table_name" = "v"."table_name" 
WHERE "t"."table_type" IN ('BASE TABLE', 'VIEW') AND "t"."table_schema" = 'public'

query result: 438 rows 2160 ms!
Why make a selection from the database all the tables when we need all the data on a single table?
Perhaps you need to cache?

@alextech
Copy link
Contributor

alextech commented Dec 7, 2016

Mine took 5.8 seconds on 239 tables :D (used jira database as test)

Indeed, this is a problem if used to find details for a single table. But for that, general getTables() is not justified. If you use

$table = new TableGateway('tableName', new MetadataFeature());

Then resultant queries have WHERE "t"."table_name" = 'tableName' injected which cuts down on execution time pretty well to 20-40ms (although still has large execution plan). Not ideal if you want to make all table gateways dynamic.

Or do you want to use this for a different purpose?

I would be very reluctant to add caching at framework level for this. Some uses, such as db migrations or admin tools in general need to have most up to date database info. It should be up to me in my application logic to dictate what is safe to assume will not change. Improved API that narrow's down query size would be better.

So I will ask if the forthmentioned usage through table gateway, which minimized the query down to a specific table and allows you to get info about it from table gateway object, is enough for you. Or if you need to get metadata specifically through $meta = new Metadata($adapter); $meta->getTable('tablename'); can think about improving current API to let user ask same questions as table gateway can (which are currently hidden in protected methods).

@Sohorev
Copy link
Author

Sohorev commented Dec 8, 2016

Because in my project does not use db view I solved the problem simply (custom PostgresqlMetadata):

class PostgresqlMetadata extends \Zend\Db\Metadata\Source\PostgresqlMetadata
{
    /**
     * {@inheritdoc}
     */
    public function getTable($tableName, $schema = null)
    {
        $table = new \Zend\Db\Metadata\Object\TableObject($tableName);
        $table->setColumns($this->getColumns($tableName, $schema));
        $table->setConstraints($this->getConstraints($tableName, $schema));
        return $table;
    }
}

@alextech
Copy link
Contributor

alextech commented Dec 8, 2016

That is a clever solution to one off usage like this (with no tablegateway) . Thank you for sharing!

@Sohorev
Copy link
Author

Sohorev commented Dec 20, 2016

PostgresqlMetadata#loadConstraintData generate query:

SELECT "t"."table_name", "tc"."constraint_name", "tc"."constraint_type", "kcu"."column_name", "cc"."check_clause", "rc"."match_option", "rc"."update_rule", "rc"."delete_rule", "kcu2"."table_schema" "referenced_table_schema", "kcu2"."table_name" "referenced_table_name", "kcu2"."column_name" "referenced_column_name" 
FROM "information_schema"."tables" t 
INNER JOIN "information_schema"."table_constraints" tc ON "t"."table_schema" = "tc"."table_schema" AND "t"."table_name" = "tc"."table_name" 
LEFT JOIN "information_schema"."key_column_usage" kcu ON "tc"."table_schema" = "kcu"."table_schema" AND "tc"."table_name" = "kcu"."table_name" AND "tc"."constraint_name" = "kcu"."constraint_name" 
LEFT JOIN "information_schema"."check_constraints" cc ON "tc"."constraint_schema" = "cc"."constraint_schema" AND "tc"."constraint_name" = "cc"."constraint_name" 
LEFT JOIN "information_schema"."referential_constraints" rc ON "tc"."constraint_schema" = "rc"."constraint_schema" AND "tc"."constraint_name" = "rc"."constraint_name" LEFT JOIN "information_schema"."key_column_usage" kcu2 ON "rc"."unique_constraint_schema" = "kcu2"."constraint_schema" AND "rc"."unique_constraint_name" = "kcu2"."constraint_name" AND "kcu"."position_in_unique_constraint" = "kcu2"."ordinal_position" 
WHERE "t"."table_name" = 'users' AND "t"."table_type" IN ('BASE TABLE', 'VIEW') AND "t"."table_schema" = 'public' 
ORDER BY CASE "tc"."constraint_type" WHEN 'PRIMARY KEY' 
THEN 1 WHEN 'UNIQUE' THEN 2 WHEN 'FOREIGN KEY' THEN 3 WHEN 'CHECK' 
THEN 4 ELSE 5 END, "tc"."constraint_name", "kcu"."ordinal_position"

It works very slowly: 43 rows1656 ms
Please fix it

@alextech
Copy link
Contributor

alextech commented Jan 3, 2017

This problem is starting to affect me too in my migrations project having these queries run on every page request. With some research turns out information_schema implementation for sake of compatibility with ANSI SQL has too many internal joins causing massive execution plan even on smallest of databases. Its basically a view over actual catalogue tables. Since Zend DB has a platform specific override for metadata, usage of this very slow abstraction may not be justified. Abstraction on top of abstraction. However, if we query against catalogue directly there is a possibility postgresql may change them between versions and we have to chase those. I do not know how much, if at all, it changes.

Anyway, for now, can I please ask you for a favor and instead of first query you posted run

SELECT 
c.relname as table_name,
CASE 
    WHEN c.oid = pg_my_temp_schema() THEN 'LOCAL TEMPORARY'
    WHEN c.relkind = 'r' THEN 'BASE TABLE'
    WHEN c.relkind = 'v' THEN 'VIEW'
    WHEN c.relkind = 'f' THEN 'FOREIGN TABLE'
END AS table_kind,
CASE
    WHEN c.relkind = 'v' THEN (select pg_get_viewdef(c.relname, true))
END as view_definition
FROM pg_class c
    JOIN pg_namespace n
        ON n.oid = c.relnamespace 
WHERE n.nspname = 'public'
AND (c.relkind = 'r' OR c.relkind = 'v')

And see (1) you do not lose any data and (2) it executes fast enough to be considered as a viable replacement?

(This does not yet include last 2 columns used by ZF Metadata, check_option, and is_updatable, I really want to see if this would help at all before researching what those are ).

That query has less joins and scans to go through without going through ANSI abstraction (screenshots included for comparison).
table names plan comparison

If this is worth it and maintainers believe doing own implementation of queries without ANSI is safe for us to do, I could consider to try converting other monsters too.

@alextech
Copy link
Contributor

alextech commented Jan 3, 2017

P.S. Maybe also consider lazy loading some of table properties, like constraints and columns. I feel full metadata detail queried at once will still cause huge execution plans, even if avoid information_schema. In my current usecase I just need table name, yet I have column and constraints queried too. Maybe those class properties can be replaced with lazy loading proxies instead of populated ahead of time. Then if anyone does hit those getters, multiple SQL quries will still be more performant than those huge things.

Can also rework queries to be more specific to framework's purpose. For example, ZF does not actually need a large CASE statement for table type, because Metadata just does

switch ($data['table_type']) {
            case 'BASE TABLE':
                $table = new Object\TableObject($tableName);
                break;
            case 'VIEW':
                $table = new Object\ViewObject($tableName);
                break;
            default:
                throw new \Exception('Table "' . $tableName . '" is of an unsupported type "' . $data['table_type'] . '"');
        }

which might as well be 'r', 'v', respectively.

Anyway, looking forward to your result, if you can.

@alextech
Copy link
Contributor

alextech commented Jan 4, 2017

OO OO this one is even faster!

SELECT 
c.relname as table_name,
c.relkind as table_type
FROM pg_class c
WHERE c.relnamespace = (SELECT oid FROM pg_namespace WHERE nspname = 'public')
AND (c.relkind = 'r' OR c.relkind = 'v') 

table names plan comparison subquery

Without a join, there are only 38 rows to work with. Annoyingly, fields needed from pg_* tables are not indexed.

@Ocramius
Copy link
Member

Ocramius commented Jan 4, 2017 via email

@alextech
Copy link
Contributor

alextech commented Jan 4, 2017

That's why I dismissed it at first too. DDL is either one time thing or a database admin utility (if someone is writing a custom one for whatever reason), both of which usecases have no need to be that performant. I've seen metadata used as part of page load to check whether database version is "safe" to use against what application expects, but that too was mostly for convenience during dev and is trivial to do manually.

What got me worried was other features that use metadata without users realizing inadvertently running extra queries. MetadataFeature for TableGateway, example I can think of right now, allows to populate table gateway class instance with details for operating it, such as what column is primary key, is auto_incremented so inserts/updates do not cause conflicts. So if instead of manually describing to table gateway the structure of the table in constructor as one typically should, they just throw new MetadataFeature() as the framework encourages to do, given the existence of this thing, and if this is done for every table in application, can get explosive. I wouldn't do it, but I can see others do it. I too am curious about original poster's usecase, just how important is this?

@alextech
Copy link
Contributor

alextech commented Jan 4, 2017

Sorry, never mind about auto_increment thing. I mixed it up with sequence feature that does that. I do not THINK table gateway does that on its own, in which case purpose of automatically populating its column fields seems less important, but it still gives option to do so for whatever reason.

mat128 added a commit to mat128/zend-db that referenced this issue Jun 13, 2018
@mat128 mat128 linked a pull request Jun 13, 2018 that will close this issue
@mat128
Copy link

mat128 commented Jun 13, 2018

While I agree that a more through analysis of query performance should be done, the query you provided worked perfectly for me.

I opened #321 so others could benefit from the performance improvement.

@michalbundyra
Copy link
Member

This repository has been closed and moved to laminas/laminas-db; a new issue has been opened at laminas/laminas-db#86.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants