Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Horizontal scaling and high availability #362

Closed
chris13524 opened this issue Feb 13, 2024 · 1 comment · Fixed by #389
Closed

Horizontal scaling and high availability #362

chris13524 opened this issue Feb 13, 2024 · 1 comment · Fixed by #389
Assignees
Labels

Comments

@chris13524
Copy link
Member

chris13524 commented Feb 13, 2024

Right now there is 1 replica running at a time. A second can run temporarily during deployment, but we should be running a minimum of 2 at all times in-case one of them crashes. Furthermore, we should be able to horizontally scale with demand.

The problem with simply enabling multiple replicas is that the background jobs will also have multiple replicas which may or may not be ok.

  • Watcher expiration job - Runs hourly and deletes any expired watchers from the database. Does not require HA, and it's not a big deal if multiple of them are running as it just runs 1 query.
  • Relay renewal job - Runs daily and renews all topics. Does not require HA. This is resource intensive and only 1 should run at a time. Also with Renew topic subscriptions only when they need to be #325 the architecture will change a bit and we may want some locking ability, or other mechanism to avoid multiple revisions renewing the same topic twice unnecessarily.
  • Publisher service - Runs continuously and publishes any messages that need to be published. Ideally has HA in-case of crash, but not critical if a notification is delayed a few minutes in rare circumstance as notifications are often delayed anyway with large queue sizes. This service may be horizontally scaled with the number of replicas, but ideally can be independently scaled in order to be better control relay load and queue processing time.

Conclusion: we can enable horizontal scaling following the change to avoid multiple relay renewal jobs running at once.

This will be non-trivial and will require some type of lock. We may be able to implement a lock with Redis but this is throw-away work once we do #325 so it may be desirable to go that direction now.

@chris13524
Copy link
Member Author

Conclusion: we can enable horizontal scaling following the change to avoid multiple relay renewal jobs running at once.

This assumption has changed because the batch_subscribe is now cheap and publishes are not required. This allows us to run renew operations potentially in parallel and it's not a big deal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants