[batch] reduce CPU use of delete_prev_..._resource_records #14366

danking · 2024-02-27T21:02:52Z

I assessed three query options, but keep in mind that the database is currently at nearly 100% CPU so these timings are high variance. I'm honestly pretty surprised that the first one isn't the fastest. It seems to me like the obvious SQL query and yet it does not perform as well as those using lateral joins.

Option 2 and Option 3 both use lateral joins. Option 2 uses a LEFT JOIN followed by a WHERE ... IS NOT NULL which is definitionally an INNER JOIN. For that reason, I also explored Option 3, directly using INNER JOIN, which appears to be very slightly faster and lower variance.

Option	Timing (mean + stddev, in seconds)
1	9.92 +- 2.02
2	5.35 +- 1.87
3	5.11 +- 1.15

Option 1

SELECT DISTINCT
  group_resources.batch_id,
  group_resources.update_id,
  group_resources.job_group_id
FROM job_group_inst_coll_cancellable_resources AS group_resources
INNER JOIN job_group_self_and_ancestors AS descendant
   ON descendant.batch_id     = group_resources.batch_id
  AND descendant.job_group_id = group_resources.job_group_id
INNER JOIN job_groups_cancelled AS cancelled
   ON descendant.batch_id     = cancelled.id
  AND descendant.ancestor_id  = cancelled.job_group_id
ORDER BY group_resources.batch_id desc, group_resources.update_id desc, group_resources.job_group_id desc
LIMIT 1000;

1000 rows in set (11.81 sec)
1000 rows in set (8.91 sec)
1000 rows in set (11.86 sec)
1000 rows in set (7.08 sec)

Option 2

SELECT DISTINCT
  group_resources.batch_id,
  group_resources.update_id,
  group_resources.job_group_id
FROM job_group_inst_coll_cancellable_resources AS group_resources
LEFT JOIN LATERAL (
  SELECT
    1 AS cancelled
  FROM job_group_self_and_ancestors AS descendant
  INNER JOIN job_groups_cancelled AS cancelled
     ON descendant.batch_id     = cancelled.id
    AND descendant.ancestor_id  = cancelled.job_group_id
  WHERE descendant.batch_id     = group_resources.batch_id
    AND descendant.job_group_id = group_resources.job_group_id
) AS t ON TRUE
WHERE t.cancelled IS NOT NULL
ORDER BY group_resources.batch_id desc, group_resources.update_id desc, group_resources.job_group_id desc
LIMIT 1000;

1000 rows in set (3.68 sec)
1000 rows in set (5.24 sec)
1000 rows in set (8.43 sec)
1000 rows in set (4.03 sec)

Option 3

SELECT DISTINCT
  group_resources.batch_id,
  group_resources.update_id,
  group_resources.job_group_id
FROM job_group_inst_coll_cancellable_resources AS group_resources
INNER JOIN LATERAL (
  SELECT
    1
  FROM job_group_self_and_ancestors AS descendant
  INNER JOIN job_groups_cancelled AS cancelled
     ON descendant.batch_id = cancelled.id
    AND descendant.ancestor_id = cancelled.job_group_id
  WHERE descendant.batch_id = group_resources.batch_id
    AND descendant.job_group_id = group_resources.job_group_id
) AS t ON TRUE
ORDER BY group_resources.batch_id desc, group_resources.update_id desc, group_resources.job_group_id desc
LIMIT 1000;

1000 rows in set (6.74 sec)
1000 rows in set (5.63 sec)
1000 rows in set (3.86 sec)
1000 rows in set (4.20 sec)

I assessed three query options, but keep in mind that the database is currently at nearly 100% CPU so these timings are high variance. I'm honestly pretty surprised that the first one isn't the fastest. It seems to me like the obvious SQL query and yet it does not perform as well as those using lateral joins. Option 2 and Option 3 both use lateral joins. Option 2 uses a LEFT JOIN followed by a `WHERE ... IS NOT NULL` which is definitionally an `INNER JOIN`. For that reason, I also explored Option 3, directly using `INNER JOIN`, which appears to be very slightly faster and lower variance. | Option | Timing (mean + stddev, in seconds) | |--------|------------------------------------| | 1 | 9.92 +- 2.02 | | 2 | 5.35 +- 1.87 | | 3 | 5.11 +- 1.15 | ## Option 1 ``` SELECT DISTINCT group_resources.batch_id, group_resources.update_id, group_resources.job_group_id FROM job_group_inst_coll_cancellable_resources AS group_resources INNER JOIN job_group_self_and_ancestors AS descendant ON descendant.batch_id = group_resources.batch_id AND descendant.job_group_id = group_resources.job_group_id INNER JOIN job_groups_cancelled AS cancelled ON descendant.batch_id = cancelled.id AND descendant.ancestor_id = cancelled.job_group_id ORDER BY group_resources.batch_id desc, group_resources.update_id desc, group_resources.job_group_id desc LIMIT 1000; ``` ``` 1000 rows in set (11.81 sec) 1000 rows in set (8.91 sec) 1000 rows in set (11.86 sec) 1000 rows in set (7.08 sec) ``` ## Option 2 ``` SELECT DISTINCT group_resources.batch_id, group_resources.update_id, group_resources.job_group_id FROM job_group_inst_coll_cancellable_resources AS group_resources LEFT JOIN LATERAL ( SELECT 1 AS cancelled FROM job_group_self_and_ancestors AS descendant INNER JOIN job_groups_cancelled AS cancelled ON descendant.batch_id = cancelled.id AND descendant.ancestor_id = cancelled.job_group_id WHERE descendant.batch_id = group_resources.batch_id AND descendant.job_group_id = group_resources.job_group_id ) AS t ON TRUE WHERE t.cancelled IS NOT NULL ORDER BY group_resources.batch_id desc, group_resources.update_id desc, group_resources.job_group_id desc LIMIT 1000; ``` ``` 1000 rows in set (3.68 sec) 1000 rows in set (5.24 sec) 1000 rows in set (8.43 sec) 1000 rows in set (4.03 sec) ``` ## Option 3 ``` SELECT DISTINCT group_resources.batch_id, group_resources.update_id, group_resources.job_group_id FROM job_group_inst_coll_cancellable_resources AS group_resources INNER JOIN LATERAL ( SELECT 1 FROM job_group_self_and_ancestors AS descendant INNER JOIN job_groups_cancelled AS cancelled ON descendant.batch_id = cancelled.id AND descendant.ancestor_id = cancelled.job_group_id WHERE descendant.batch_id = group_resources.batch_id AND descendant.job_group_id = group_resources.job_group_id ) AS t ON TRUE ORDER BY group_resources.batch_id desc, group_resources.update_id desc, group_resources.job_group_id desc LIMIT 1000; ``` ``` 1000 rows in set (6.74 sec) 1000 rows in set (5.63 sec) 1000 rows in set (3.86 sec) 1000 rows in set (4.20 sec) ```

daniel-goldstein

Interesting. I also find the first query far easier to read than the lateral joins, but this seems the right change to make for now.

ehigham · 2024-07-25T19:35:27Z

Following up on this. I can't measure a difference between 1 and 3 and both execution plans are identical. In prod, both queries return empty.

danking · 2024-07-25T20:33:49Z

If the query plans are identical I’d expect identical performance. Maybe the DB version was incremented and the optimizer is better? These queries are only interesting when the system is under enough load to have “cancellable resources” i.e. running jobs.

daniel-goldstein · 2024-07-25T21:06:34Z

You would think, except for #14623... ideally you would be correct

ehigham · 2024-07-26T18:01:01Z

If the query plans are identical I’d expect identical performance. Maybe the DB version was incremented and the optimizer is better?

Thanks for your reply! Maybe/I hope so? IMO the EXPLAIN output can be a little hard to decipher and I suspect doesn't tell the full story. For academic purposes, I was also interested in seeing what difference GROUP BY has vs DISTINCT but 0 records makes this challening!

ehigham · 2024-07-26T18:21:02Z

We're now using MySQL 8.0.28-google, fwiw. Can you remember what you tested on? That might account for the differences you observed for sure.

danking · 2024-07-26T19:27:53Z

I do not recall but I think it was after the upgrade to major version 8

danking assigned daniel-goldstein Feb 27, 2024

danking force-pushed the fix-slow-groups-query branch from 32aa19b to 27d1fad Compare February 27, 2024 21:09

daniel-goldstein approved these changes Feb 27, 2024

View reviewed changes

daniel-goldstein added the prio:high label Feb 27, 2024

danking merged commit a325690 into hail-is:main Feb 27, 2024
3 checks passed

ehigham mentioned this pull request Jul 25, 2024

[batch] make batches query go brrrrrrr #14629

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[batch] reduce CPU use of delete_prev_..._resource_records #14366

[batch] reduce CPU use of delete_prev_..._resource_records #14366

danking commented Feb 27, 2024 •

edited

Loading

daniel-goldstein left a comment

ehigham commented Jul 25, 2024 •

edited

Loading

danking commented Jul 25, 2024

daniel-goldstein commented Jul 25, 2024

ehigham commented Jul 26, 2024 •

edited

Loading

ehigham commented Jul 26, 2024

danking commented Jul 26, 2024

[batch] reduce CPU use of delete_prev_..._resource_records #14366

[batch] reduce CPU use of delete_prev_..._resource_records #14366

Conversation

danking commented Feb 27, 2024 • edited Loading

Option 1

Option 2

Option 3

daniel-goldstein left a comment

Choose a reason for hiding this comment

ehigham commented Jul 25, 2024 • edited Loading

danking commented Jul 25, 2024

daniel-goldstein commented Jul 25, 2024

ehigham commented Jul 26, 2024 • edited Loading

ehigham commented Jul 26, 2024

danking commented Jul 26, 2024

danking commented Feb 27, 2024 •

edited

Loading

ehigham commented Jul 25, 2024 •

edited

Loading

ehigham commented Jul 26, 2024 •

edited

Loading