Horizontal scaling and high availability #362

chris13524 · 2024-02-13T16:58:10Z

Right now there is 1 replica running at a time. A second can run temporarily during deployment, but we should be running a minimum of 2 at all times in-case one of them crashes. Furthermore, we should be able to horizontally scale with demand.

The problem with simply enabling multiple replicas is that the background jobs will also have multiple replicas which may or may not be ok.

Watcher expiration job - Runs hourly and deletes any expired watchers from the database. Does not require HA, and it's not a big deal if multiple of them are running as it just runs 1 query.
Relay renewal job - Runs daily and renews all topics. Does not require HA. This is resource intensive and only 1 should run at a time. Also with Renew topic subscriptions only when they need to be #325 the architecture will change a bit and we may want some locking ability, or other mechanism to avoid multiple revisions renewing the same topic twice unnecessarily.
Publisher service - Runs continuously and publishes any messages that need to be published. Ideally has HA in-case of crash, but not critical if a notification is delayed a few minutes in rare circumstance as notifications are often delayed anyway with large queue sizes. This service may be horizontally scaled with the number of replicas, but ideally can be independently scaled in order to be better control relay load and queue processing time.

Conclusion: we can enable horizontal scaling following the change to avoid multiple relay renewal jobs running at once.

This will be non-trivial and will require some type of lock. We may be able to implement a lock with Redis but this is throw-away work once we do #325 so it may be desirable to go that direction now.

chris13524 · 2024-02-26T22:26:28Z

Conclusion: we can enable horizontal scaling following the change to avoid multiple relay renewal jobs running at once.

This assumption has changed because the batch_subscribe is now cheap and publishes are not required. This allows us to run renew operations potentially in parallel and it's not a big deal.

chris13524 self-assigned this Feb 13, 2024

arein added the accepted label Feb 13, 2024

chris13524 mentioned this issue Feb 13, 2024

Renew topic subscriptions only when they need to be #325

Open

chris13524 assigned chris13524 and unassigned chris13524 Feb 19, 2024

chris13524 mentioned this issue Feb 26, 2024

fix: enable autoscaling #389

Merged

3 tasks

chris13524 closed this as completed in #389 Mar 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Horizontal scaling and high availability #362

Horizontal scaling and high availability #362

chris13524 commented Feb 13, 2024 •

edited

Loading

chris13524 commented Feb 26, 2024

Horizontal scaling and high availability #362

Horizontal scaling and high availability #362

Comments

chris13524 commented Feb 13, 2024 • edited Loading

chris13524 commented Feb 26, 2024

chris13524 commented Feb 13, 2024 •

edited

Loading