Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Moving to Elastic Search was a mistake #1

Open
ctrlcctrlv opened this issue Nov 28, 2022 · 4 comments
Open

Moving to Elastic Search was a mistake #1

ctrlcctrlv opened this issue Nov 28, 2022 · 4 comments

Comments

@ctrlcctrlv
Copy link

image

…in general, but at least for pagination/non-full text lookups (and even for these there are GIN and GiST indexes).

In Discord I was told it was moved to due to problems with slow lookups.

I have solved this.

While originally I tried a bit of an insane solution with hash indices, real world testing showed that a fuckload of btree indices narrowly beats that.

posts_post_fk_view

CREATE MATERIALIZED VIEW posts_post_fk_view AS
 SELECT pp.*,
    pl.name::text AS platform_name,
    bo.name::text AS board_name,
    pp.thread_id::text AS thread_id_t,
    pp.post_id::text AS post_id_t
   FROM posts_post pp
     JOIN posts_platform pl ON pl.id = pp.platform_id
     JOIN posts_board bo ON bo.id = pp.board_id;

Indices

CREATE INDEX posts_post_fk_view_pkey ON posts_post_fk_view USING BTREE(id);
CREATE INDEX posts_post_view_platform_board_thread_sort_idx ON posts_post_view USING BTREE(platform_name, board_name, thread_id);
CREATE INDEX posts_post_view_platform_board_sort_idx ON posts_post_view USING BTREE(platform_name, board_name);
CREATE INDEX posts_post_view_board_thread_sort_idx ON posts_post_view USING BTREE(board_name, thread_id);
CREATE INDEX posts_post_view_board_thread_sort_idx ON posts_post_view USING BTREE(board_name, thread_id);
CREATE INDEX posts_post_view_board_thread_post_sort_idx ON posts_post_view USING BTREE(board_name, thread_id, post_id);
CREATE INDEX posts_post_view_board_thread_post_sort_idx ON posts_post_view USING BTREE(board_name, thread_id, post_id);
CREATE INDEX posts_post_fk_view_board_thread_post_sort_idx ON posts_post_fk_view USING BTREE(board_name, thread_id, post_id);
CREATE INDEX posts_post_fk_view_timestamp_idx ON posts_post_fk_view USING BTREE(timestamp);
CREATE INDEX posts_post_fk_view_board_thread_sort_idx ON posts_post_fk_view USING BTREE(board_name, thread_id);
CREATE INDEX posts_post_fk_view_platform_sort_idx ON posts_post_fk_view USING BTREE(platform_name); 
CREATE INDEX posts_post_fk_view_platform_board_sort_idx ON posts_post_fk_view USING BTREE(platform_name, board_name);
CREATE INDEX posts_post_fk_view_platform_board_thread_sort_idx ON posts_post_fk_view USING BTREE(platform_name, board_name, thread_id);
CREATE INDEX posts_post_fk_view_platform_board_thread_post_sort_idx ON posts_post_fk_view USING BTREE(platform_name, board_name, thread_id, post_id);
CREATE INDEX posts_post_fk_view_thread_post_sort_idx ON posts_post_fk_view USING BTREE(thread_id, post_id);
CREATE INDEX posts_post_fk_view_platform_board_sort_idx ON posts_post_fk_view USING BTREE(platform_name, board_name);
CREATE INDEX posts_post_fk_view_board_sort_idx ON posts_post_fk_view USING BTREE(board_name);
CREATE INDEX posts_post_fk_view_thread_sort_idx ON posts_post_fk_view USING BTREE(thread_id);
CREATE INDEX posts_post_fk_view_post_sort_idx ON posts_post_fk_view USING BTREE(post_id);

Benchmark: From 4800ms to 4ms

dchan=# EXPLAIN (ANALYZE ON, VERBOSE ON, WAL ON, BUFFERS ON, SETTINGS ON, TIMING ON, COSTS ON, SUMMARY ON, FORMAT TEXT) SELECT * FROM posts_post pp JOIN posts_board bo ON bo.id = pp.board_id JOIN posts_platform pl ON pl.id = pp.platform_id WHERE pl.name = '8kun' AND bo.name = 'qresearch' ORDER BY TIMESTAMP DESC LIMIT 10000;
                                                                                                                                                                    QUERY PLAN                                                                                                                                                                    
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=154466.46..154466.47 rows=6 width=884) (actual time=4659.433..4663.587 rows=10000 loops=1)
   Output: pp.id, pp.thread_id, pp.post_id, pp.author, pp.poster_hash, pp.subject, pp.body, pp."timestamp", pp.tripcode, pp.is_op, pp.links, pp.search_vector, pp.created_at, pp.last_modified, pp.board_id, pp.platform_id, pp.replies, pp.body_html, pp.source, bo.id, bo.name, bo.platform_id, bo.migrated_to_8kun, pl.id, pl.name
   Buffers: shared hit=2 read=144815, temp read=273408 written=513239
   ->  Sort  (cost=154466.46..154466.47 rows=6 width=884) (actual time=4499.871..4503.326 rows=10000 loops=1)
         Output: pp.id, pp.thread_id, pp.post_id, pp.author, pp.poster_hash, pp.subject, pp.body, pp."timestamp", pp.tripcode, pp.is_op, pp.links, pp.search_vector, pp.created_at, pp.last_modified, pp.board_id, pp.platform_id, pp.replies, pp.body_html, pp.source, bo.id, bo.name, bo.platform_id, bo.migrated_to_8kun, pl.id, pl.name
         Sort Key: pp."timestamp" DESC
         Sort Method: external merge  Disk: 965248kB
         Buffers: shared hit=2 read=144815, temp read=273408 written=513239
         ->  Nested Loop  (cost=0.73..154466.38 rows=6 width=884) (actual time=0.061..1958.435 rows=1348719 loops=1)
               Output: pp.id, pp.thread_id, pp.post_id, pp.author, pp.poster_hash, pp.subject, pp.body, pp."timestamp", pp.tripcode, pp.is_op, pp.links, pp.search_vector, pp.created_at, pp.last_modified, pp.board_id, pp.platform_id, pp.replies, pp.body_html, pp.source, bo.id, bo.name, bo.platform_id, bo.migrated_to_8kun, pl.id, pl.name
               Join Filter: (pp.board_id = bo.id)
               Rows Removed by Join Filter: 48565
               Buffers: shared hit=2 read=144815, temp written=119954
               ->  Index Scan using unique_board on public.posts_board bo  (cost=0.15..19.47 rows=2 width=155) (actual time=0.027..0.028 rows=1 loops=1)
                     Output: bo.id, bo.name, bo.platform_id, bo.migrated_to_8kun
                     Index Cond: ((bo.name)::text = 'qresearch'::text)
                     Buffers: shared hit=1 read=1
               ->  Materialize  (cost=0.58..154409.29 rows=1368 width=729) (actual time=0.029..1792.513 rows=1397284 loops=1)
                     Output: pp.id, pp.thread_id, pp.post_id, pp.author, pp.poster_hash, pp.subject, pp.body, pp."timestamp", pp.tripcode, pp.is_op, pp.links, pp.search_vector, pp.created_at, pp.last_modified, pp.board_id, pp.platform_id, pp.replies, pp.body_html, pp.source, pl.id, pl.name
                     Buffers: shared hit=1 read=144814, temp written=119954
                     ->  Nested Loop  (cost=0.58..154402.45 rows=1368 width=729) (actual time=0.025..588.462 rows=1397284 loops=1)
                           Output: pp.id, pp.thread_id, pp.post_id, pp.author, pp.poster_hash, pp.subject, pp.body, pp."timestamp", pp.tripcode, pp.is_op, pp.links, pp.search_vector, pp.created_at, pp.last_modified, pp.board_id, pp.platform_id, pp.replies, pp.body_html, pp.source, pl.id, pl.name
                           Buffers: shared hit=1 read=144814
                           ->  Index Scan using posts_platform_name_d8719698_like on public.posts_platform pl  (cost=0.15..8.17 rows=1 width=50) (actual time=0.006..0.006 rows=1 loops=1)
                                 Output: pl.id, pl.name
                                 Index Cond: ((pl.name)::text = '8kun'::text)
                                 Buffers: shared hit=1 read=1
                           ->  Index Scan using posts_post_platform_id_63b50d92 on public.posts_post pp  (cost=0.43..147279.33 rows=711495 width=679) (actual time=0.014..496.234 rows=1397284 loops=1)
                                 Output: pp.id, pp.thread_id, pp.post_id, pp.author, pp.poster_hash, pp.subject, pp.body, pp."timestamp", pp.tripcode, pp.is_op, pp.links, pp.search_vector, pp.created_at, pp.last_modified, pp.board_id, pp.platform_id, pp.replies, pp.body_html, pp.source
                                 Index Cond: (pp.platform_id = pl.id)
                                 Buffers: shared read=144813
 Settings: enable_seqscan = 'off', jit_inline_above_cost = '0', jit_optimize_above_cost = '0'
 Planning Time: 0.628 ms
 JIT:
   Functions: 18
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 2.054 ms, Inlining 6.549 ms, Optimization 94.448 ms, Emission 58.567 ms, Total 161.618 ms
 Execution Time: 4826.601 ms
(38 rows)
dchan=# EXPLAIN (ANALYZE ON, VERBOSE ON, WAL ON, BUFFERS ON, SETTINGS ON, TIMING ON, COSTS ON, SUMMARY ON, FORMAT TEXT) SELECT id FROM posts_post_fk_view WHERE (platform_name = '8kun' AND board_name = 'qresearch') ORDER BY TIMESTAMP DESC LIMIT 10000;
                                                                                        QUERY PLAN                                                                                         
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.43..3018.97 rows=10000 width=16) (actual time=0.013..4.163 rows=10000 loops=1)
   Output: id, "timestamp"
   Buffers: shared hit=4888
   ->  Index Scan Backward using posts_post_fk_view_timestamp_idx on public.posts_post_fk_view  (cost=0.43..399825.67 rows=1324563 width=16) (actual time=0.012..3.748 rows=10000 loops=1)
         Output: id, "timestamp"
         Filter: ((posts_post_fk_view.platform_name = '8kun'::text) AND (posts_post_fk_view.board_name = 'qresearch'::text))
         Rows Removed by Filter: 1
         Buffers: shared hit=4888
 Settings: enable_seqscan = 'off', jit_inline_above_cost = '0', jit_optimize_above_cost = '0'
 Planning Time: 0.094 ms
 Execution Time: 4.471 ms
(11 rows)
@ctrlcctrlv
Copy link
Author

(Note: You should not use enable_seqscan = 'off' on a real production server. It is good to enable locally for testing so that subsequent SELECT's never rely on sequential scans of RAM, as in a real server RAM will change a lot faster.)

@ctrlcctrlv
Copy link
Author

This also significantly helps the (platform, board, thread) case:

Benchmark: 3.395ms to 0.231ms

dchan=# EXPLAIN (ANALYZE ON, VERBOSE ON, WAL OFF, BUFFERS OFF, SETTINGS ON, TIMING ON, COSTS ON, SUMMARY ON, FORMAT TEXT) SELECT id FROM posts_post_fk_view WHERE (platform_name = '8kun' AND board_name = 'qresearch' AND thread_id = '579816'::bigint) LIMIT 10000;
                                                                                        QUERY PLAN                                                                                        
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.43..1222.62 rows=667 width=8) (actual time=0.017..0.204 rows=752 loops=1)
   Output: id
   ->  Index Scan using posts_post_fk_view_platform_board_thread_sort_idx on public.posts_post_fk_view  (cost=0.43..1222.62 rows=667 width=8) (actual time=0.016..0.169 rows=752 loops=1)
         Output: id
         Index Cond: ((posts_post_fk_view.platform_name = '8kun'::text) AND (posts_post_fk_view.board_name = 'qresearch'::text) AND (posts_post_fk_view.thread_id = '579816'::bigint))
 Settings: enable_seqscan = 'off', jit_inline_above_cost = '0', jit_optimize_above_cost = '0'
 Planning Time: 0.094 ms
 Execution Time: 0.231 ms
(8 rows)

dchan=# EXPLAIN (ANALYZE ON, VERBOSE ON, WAL OFF, BUFFERS OFF, SETTINGS ON, TIMING ON, COSTS ON, SUMMARY ON, FORMAT TEXT) SELECT * FROM posts_post pp JOIN posts_board bo ON bo.id = pp.board_id JOIN posts_platform pl ON pl.id = pp.platform_id WHERE pl.name = '8kun' AND bo.name = 'qresearch' AND thread_id='579816'::bigint;
                                                                                                                                                              QUERY PLAN                                                                                                                                                              
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=21.75..2747.56 rows=1 width=884) (actual time=0.229..3.254 rows=752 loops=1)
   Output: pp.id, pp.thread_id, pp.post_id, pp.author, pp.poster_hash, pp.subject, pp.body, pp."timestamp", pp.tripcode, pp.is_op, pp.links, pp.search_vector, pp.created_at, pp.last_modified, pp.board_id, pp.platform_id, pp.replies, pp.body_html, pp.source, bo.id, bo.name, bo.platform_id, bo.migrated_to_8kun, pl.id, pl.name
   Inner Unique: true
   ->  Nested Loop  (cost=21.60..2747.26 rows=1 width=729) (actual time=0.208..1.529 rows=752 loops=1)
         Output: pp.id, pp.thread_id, pp.post_id, pp.author, pp.poster_hash, pp.subject, pp.body, pp."timestamp", pp.tripcode, pp.is_op, pp.links, pp.search_vector, pp.created_at, pp.last_modified, pp.board_id, pp.platform_id, pp.replies, pp.body_html, pp.source, pl.id, pl.name
         Join Filter: (pp.platform_id = pl.id)
         ->  Index Scan using posts_platform_name_d8719698_like on public.posts_platform pl  (cost=0.15..8.17 rows=1 width=50) (actual time=0.060..0.062 rows=1 loops=1)
               Output: pl.id, pl.name
               Index Cond: ((pl.name)::text = '8kun'::text)
         ->  Bitmap Heap Scan on public.posts_post pp  (cost=21.45..2730.31 rows=703 width=679) (actual time=0.140..1.208 rows=752 loops=1)
               Output: pp.id, pp.thread_id, pp.post_id, pp.author, pp.poster_hash, pp.subject, pp.body, pp."timestamp", pp.tripcode, pp.is_op, pp.links, pp.search_vector, pp.created_at, pp.last_modified, pp.board_id, pp.platform_id, pp.replies, pp.body_html, pp.source
               Recheck Cond: (pp.thread_id = '579816'::bigint)
               Heap Blocks: exact=92
               ->  Bitmap Index Scan on posts_post_thread_hash_idx  (cost=0.00..21.27 rows=703 width=0) (actual time=0.098..0.098 rows=752 loops=1)
                     Index Cond: (pp.thread_id = '579816'::bigint)
   ->  Index Scan using posts_board_pkey on public.posts_board bo  (cost=0.15..0.24 rows=1 width=155) (actual time=0.001..0.001 rows=1 loops=752)
         Output: bo.id, bo.name, bo.platform_id, bo.migrated_to_8kun
         Index Cond: (bo.id = pp.board_id)
         Filter: ((bo.name)::text = 'qresearch'::text)
 Settings: enable_seqscan = 'off', jit_inline_above_cost = '0', jit_optimize_above_cost = '0'
 Planning Time: 0.676 ms
 Execution Time: 3.395 ms
(22 rows)

@KunstDerFuge
Copy link
Owner

This is all extremely promising! I'm excited to include this all in the codebase. I was initially hesitant about including raw SQL as it seems un Django-ish (Djangoistic?) but after looking through documentation it seems there is currently no way to create managed materialized views natively in Django.

Getting the views to interface with Django is an additional step, but it looks like a pretty straightforward one. A Django model needs to be created for the view with metadata specifying managed = False and given the table name for the (already-created) view. This can then be queried as usual by the Django ORM.

So it seems like a reasonable next step would be to create perhaps a new management command for the project which executes a raw SQL query dropping and then creating all of the views needed, then create corresponding Django models for each new view.

As for new indices, there is a native way to create them on models and it appears btree is the default. So going forward I'd like to go with this rule of thumb:

It's okay to use raw SQL for creating views and indices, but all actual querying of records should be done through the ORM with QuerySets where possible.

I'll do my best to implement the needed changes with those restrictions in mind!

@ctrlcctrlv
Copy link
Author

That sounds reasonable, although some of my other queries may be too insane to easily map to the ORM, esp #3's.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants