Skip to content
This repository has been archived by the owner on Jul 25, 2024. It is now read-only.

(very)WIP: Paginate comparison to improve performance #883

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

chaws
Copy link
Collaborator

@chaws chaws commented Oct 23, 2020

Fixes #846

This is an early WIP that attempts to improve load times for comparison operations.

Problem

Here's what currently happens (with very ugly pseudo code summarized from real code):

1. builds = [baseline_build, target_build]
2. testruns = TestRun.filter(builds)
3.
4. # take batches of 100 testruns
5. tests = []
6. for split_testruns in split_dict(testruns, 100):
7.    tests.append(Test.filter(split_testruns))
8.
9. results_table = apply_transitions(tests, [('pass', 'fail'), ('fail', 'pass'))])
10. page = Paginate(results_table, page=request.GET.get('page'), per_page=50)

Problem starts on lines 5-7, where all tests are loaded into memory. A while ago I changed this so we could fetch tests without applying JOIN between Test and TestRun by fetching tests using batches of testruns. This solved an issue that was slowing down the DB.

With android-lkft testruns containing millions of tests, this snippet no longer perform well. This query runs really slow!

There's a second problem on lines 9-10. It does a full comparison across all tests and all builds to just then display a teensy-weensy portion of it.

Idea

The perfect solution was to get a list of the tests to be displayed prior to run full comparison. The problem is there's no easy way to get that without either loading all tests in memory (bad) or JOIN Test and Build ON TestRun (really bad).

The idea of this PR is a best effort torwards that idea. We can get a smaller/per-page list of tests from the DB if we restrict filtering to use one Suite and one (or a few) TestRuns.

Here's what I'm suggesting

1. suite = Suite.filter(request.GET.get('suite_slug'))
2. page = request.GET.get('page')
3. per_page = 50
4.
5. builds = [baseline_build, target_build]
6. testruns = TestRun.filter(builds)
7. statuses = Status.filter(testruns, suite)
8.
9. # Now testruns contains only the ones with suite_slug in it
10. testruns = [s.test_run for s in statuses]
11.
12. # testruns should be 1 or a small number
13. tests = Test.filter(testruns, suite).order_by('metadata__name')[page:page + per_page]
14.
15. page = apply_transitions(tests, [('pass', 'fail'), ('fail', 'pass'))])

The biggest difference is that we'd be fetching paginated tests from the DB, so instead of loading millions of tests, only per_page=50 is loaded.

I've pulled the build with 1M+ tests and run a comparison against a build with much less tests and it loaded almost instantaneously. I guess DB is OK with applying OFFSET and LIMIT on a dataset with millions of tests, it just doesn't like to make joins on a half a billion records table.

I still need to figure out what to do if the number of items in page is less than per_page after apply_transitions. Maybe a new trip to DB can be done, I dunno yet.

Thoughts? @mwasilew @stevanradakovic

@chaws chaws marked this pull request as draft October 23, 2020 12:05
@chaws chaws force-pushed the fix-build-compare-with-1M-tests branch from f4b09d5 to 1550d64 Compare October 23, 2020 12:06
@mwasilew
Copy link
Contributor

@chaws hmm, in case of CTS there is more thank 600k tests belonging to a single suite. Will this still work in this case?

@mwasilew
Copy link
Contributor

Ah, I forgot to mention that the reason we had this issue was a deliberate decision to worry about the performance of build comparison when it becomes an issue. Now it did :)

@chaws
Copy link
Collaborator Author

chaws commented Oct 23, 2020

@mwasilew

hmm, in case of CTS there is more thank 600k tests belonging to a single suite. Will this still work in this case?

Yes! I'm no DB expert, but by trial-and-error I find that DB handles the million-ish dataset with ease. Below is the query that fetches all tests from build 45976, the one reported in the issue:

productionqareports=> EXPLAIN ANALYZE SELECT 
  "core_test"."id", 
  "core_test"."test_run_id", 
  "core_test"."suite_id", 
  "core_test"."metadata_id", 
  "core_test"."result", 
  "core_test"."has_known_issues", 
  "core_suite"."slug" AS "suite_slug"
FROM "core_test"
  INNER JOIN "core_suite" ON ("core_test"."suite_id" = "core_suite"."id")
  LEFT OUTER JOIN "core_suitemetadata" ON ("core_test"."metadata_id" = "core_suitemetadata"."id")
WHERE "core_test"."test_run_id" IN (SELECT id FROM core_testrun WHERE build_id = 45976)
ORDER BY "core_suitemetadata"."name" ASC;

                                                                                    QUERY PLAN                                                                                     
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=1725652.96..1725792.26 rows=55723 width=121) (actual time=72396.237..80160.395 rows=1379759 loops=1)
   Sort Key: core_suitemetadata.name
   Sort Method: external merge  Disk: 231952kB
   ->  Hash Join  (cost=1318318.22..1717641.32 rows=55723 width=121) (actual time=18788.063..23098.515 rows=1379759 loops=1)
         Hash Cond: (core_test.suite_id = core_suite.id)
         ->  Merge Right Join  (cost=1315403.10..1711894.00 rows=55723 width=73) (actual time=18757.300..22761.288 rows=1379759 loops=1)
               Merge Cond: (core_suitemetadata.id = core_test.metadata_id)
               ->  Index Scan using core_suitemetadata_pkey on core_suitemetadata  (cost=0.43..375748.75 rows=8719637 width=59) (actual time=0.034..2857.290 rows=7996300 loops=1)
               ->  Sort  (cost=1315367.83..1315507.14 rows=55723 width=18) (actual time=18736.320..18929.985 rows=1379759 loops=1)
                     Sort Key: core_test.metadata_id
                     Sort Method: external sort  Disk: 43168kB
                     ->  Nested Loop  (cost=1.00..1310975.19 rows=55723 width=18) (actual time=3.088..17502.126 rows=1379759 loops=1)
                           ->  Index Scan using core_testrun_9b69b72a on core_testrun  (cost=0.43..423.18 rows=143 width=4) (actual time=0.026..0.185 rows=22 loops=1)
                                 Index Cond: (build_id = 45976)
                           ->  Index Scan using core_test_ba18909e on core_test  (cost=0.57..8583.61 rows=58109 width=18) (actual time=1.281..782.847 rows=62716 loops=22)
                                 Index Cond: (test_run_id = core_testrun.id)
         ->  Hash  (cost=1433.39..1433.39 rows=66539 width=52) (actual time=30.498..30.498 rows=66895 loops=1)
               Buckets: 65536  Batches: 2  Memory Usage: 3306kB
               ->  Seq Scan on core_suite  (cost=0.00..1433.39 rows=66539 width=52) (actual time=0.010..10.417 rows=66895 loops=1)
 Planning time: 12.659 ms
 Execution time: 80336.812 ms

The query alone takes 80 seconds to run, and that's not even counting the time django will take to turn data into objects!

Now if we restrict the same query to contain only tests from cts (65615) suite (600k+ tests) and paginate it we get:

productionqareports=> EXPLAIN ANALYZE SELECT 
  "core_test"."id",
  "core_test"."test_run_id",
  "core_test"."suite_id",
  "core_test"."metadata_id",
  "core_test"."result",
  "core_test"."has_known_issues"
FROM "core_test"
  LEFT OUTER JOIN "core_suitemetadata" ON ("core_test"."metadata_id" = "core_suitemetadata"."id")
WHERE ("core_test"."suite_id" = 65615 AND "core_test"."test_run_id" IN (SELECT id FROM core_testrun WHERE build_id = 45976))
ORDER BY "core_suitemetadata"."name" ASC
LIMIT 50
OFFSET 10;
                                                                             QUERY PLAN                                                                             
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1250239.00..1250239.12 rows=50 width=73) (actual time=3646.390..3646.403 rows=50 loops=1)
   ->  Sort  (cost=1250238.97..1250239.39 rows=165 width=73) (actual time=3646.387..3646.399 rows=60 loops=1)
         Sort Key: core_suitemetadata.name
         Sort Method: top-N heapsort  Memory: 34kB
         ->  Nested Loop Left Join  (cost=1.44..1250233.27 rows=165 width=73) (actual time=583.304..3033.245 rows=664842 loops=1)
               ->  Nested Loop  (cost=1.00..1248899.18 rows=165 width=18) (actual time=583.256..1548.414 rows=664842 loops=1)
                     ->  Index Scan using core_testrun_9b69b72a on core_testrun  (cost=0.43..423.18 rows=143 width=4) (actual time=0.053..1.267 rows=22 loops=1)
                           Index Cond: (build_id = 45976)
                     ->  Index Scan using core_test_ba18909e on core_test  (cost=0.57..8728.88 rows=172 width=18) (actual time=35.829..65.465 rows=30220 loops=22)
                           Index Cond: (test_run_id = core_testrun.id)
                           Filter: (suite_id = 65615)
                           Rows Removed by Filter: 32496
               ->  Index Scan using core_suitemetadata_pkey on core_suitemetadata  (cost=0.43..8.08 rows=1 width=59) (actual time=0.002..0.002 rows=1 loops=664842)
                     Index Cond: (core_test.metadata_id = id)
 Planning time: 21.687 ms
 Execution time: 3646.578 ms

Which takes almost 4 seconds, 20x faster!

@mwasilew
Copy link
Contributor

@chaws impressive!

@chaws
Copy link
Collaborator Author

chaws commented Oct 26, 2020

I've done some more experimenting, and getting 1000 tests takes the same amount of time as taking 50 tests. Getting over 10k tests makes the query dozens of times slower because it uses disk to sort rows.

@mwasilew
Copy link
Contributor

I think pagination combined with search/filtering should do the trick. There is really no need to show all compared results at the same time.

@chaws
Copy link
Collaborator Author

chaws commented Oct 29, 2020

I've been thinking about the same thing. I'm also thinking about making comparison an api endpoint to be called from comparison view, the latter would make sequential smaller calls to the backend.

@mwasilew I want to start moving files to S3 (almost there) before I get back to this PR

@mwasilew
Copy link
Contributor

Sure, I think the current changes in progress are more important than the ones not yet started.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Build compare time out
2 participants