You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are looking to improve the performance of gh-ost so that we can safely operate with larger database sizes. Our use case is a bit different to GitHub’s in that we do not use read-replicas. Some of the DBs have very cyclic usage as well (i.e. only busy 9-5 M-F), and may have windows of free capacity.
I have a few ideas I wanted to run by you since I’m sure some have come up prior:
Gh-ost can observe the exec time of processing chunks, and dynamically increase the size if it fits below a threshold. For our environment (because we have a lot of replica tolerance) we typically run larger batch sizes, but have varying DB instance sizes. Being able to have this auto-tune is a win for us.
Parallel replication apply is much better in MySQL 8.0 – so combined with that we don’t use read-replicas, we can probably push more changes through the binlog than gh-ost currently does. We think we can tolerate a few minutes of replica lag. Our limit is Aurora restricts the relay log to ~1000M, if we exceed that.. we reduce our DR capabilities. (Note: there's an earlier issue on this. It lacks the 8.0 parallel context, and the issue @shlomi-noach probably hit when he said it is slower, is possibly this one? In any case, I've verified I can bulk-parallel insert with improved performance.)
Not started
Defer Binary Log Apply
Currently gh-ost prioritizes applying the binary log ahead of copying rows. I actually think it’s possible to track only the primary keys that were discovered in the binary log in memory + if the last modification was a delete or not (bool). If this is kept in a map, then it can be applied after the copy is done. The benefit of this change is most evident in workloads that tend to update the same rows. Edit: This optimization requires mem-comparable primary keys. So it won't work on varchar primary keys with collations.
Not started
Resume from failure
I know there is a stale PR for this. This doesn’t improve the performance, but it’s semi-related since some of our long running DDLs fail. We also like to use daily pod-cycling on our k8s clusters, so having 2 week long single processes complicates our infra.
The current ETA estimator is based on estimatedTime - elapsedTime from the start of the copy. This skews poorly for larger tables, which become slower to insert into. As dynamic chunk size/throttling is introduced it also doesn't respond to changes well with a more accurate estimate. Ideally the estimate evaluates how many rows are left to copy and compares that to how many rows were copied in the last few minutes.
That's the raw idea list - there is a good chance we will be able to provide patches for some of these too, but I wanted to check-in first so we can discuss. Maybe you have a few of your own ideas too? :-)
The text was updated successfully, but these errors were encountered:
We are looking to improve the performance of
gh-ost
so that we can safely operate with larger database sizes. Our use case is a bit different to GitHub’s in that we do not use read-replicas. Some of the DBs have very cyclic usage as well (i.e. only busy 9-5 M-F), and may have windows of free capacity.I have a few ideas I wanted to run by you since I’m sure some have come up prior:
gh-ost
currently does. We think we can tolerate a few minutes of replica lag. Our limit is Aurora restricts the relay log to ~1000M, if we exceed that.. we reduce our DR capabilities. (Note: there's an earlier issue on this. It lacks the 8.0 parallel context, and the issue @shlomi-noach probably hit when he said it is slower, is possibly this one? In any case, I've verified I can bulk-parallel insert with improved performance.)estimatedTime - elapsedTime
from the start of the copy. This skews poorly for larger tables, which become slower to insert into. As dynamic chunk size/throttling is introduced it also doesn't respond to changes well with a more accurate estimate. Ideally the estimate evaluates how many rows are left to copy and compares that to how many rows were copied in the last few minutes.That's the raw idea list - there is a good chance we will be able to provide patches for some of these too, but I wanted to check-in first so we can discuss. Maybe you have a few of your own ideas too? :-)
The text was updated successfully, but these errors were encountered: