Skip to content

Conversation

@sbernauer
Copy link
Member

@sbernauer sbernauer commented Jan 15, 2025

Description

As decided in https://github.com/stackabletech/decisions/issues/38

After the implementation of stackabletech/issues#586, we now can decide on default lifetimes per role.
As for the implementation we sticked with the existent default of 24 hours, but now is the time where we can decide on the actual default lifetimes.
Changing them is trivial, the hard part is agreeing on the defaults ;)

The default lifetimes are a clear tradeoff.

Arguments favoring short lifetimes are:

  1. Security: Short-lived TLS certs reduce the impact of leaked private keys.
  2. Up-to-date config: Frequent Pod restarts guarantee that configs are up-to-date (and not e.g. a change in a ConfigMap is applied 14 days later after a Pod restart). However, conceptually restart-controller should(tm) prevent this anyway.

Arguments favoring longer lifetimes are:

  1. Data loss: Druid in particular tends to loose data on reboot of (at least the middle manager). I'm a Druid Noob (as we probably all are), but I don't see how I can tell a middle manager to flush it's data to deep storage. I actually already lost (test) data because of the schedule restarts we are talking about here.
  2. Some tools don't offer HA, e.g. the Trino coordinator. Restarting them causes all running queries to fail. New queries can not be submitted until the coordinator is healthy again.

In an ideal world all our apps would be cloud-native (whatever that means).
As an alternative, we would raise PRs at the tools to fix the data loss/availability problems, but given the authors/maintainers have not achieved that goal yet, that sounds very advanced.

Proposal

For most of the tools leave we can leave the default at 24h.

However, we need to make some exception for the ones that can not cope very well with restarts:

Product Role Default lifetime
Trino coordinator 15d
Druid middle manager 15d

The lifetime of 15 days was chosen, such that it is longer than the usual weekly (e.g. IONOS) or bi-weekly cluster (e.g. on-prem clusters) re-deployments.

Definition of Done Checklist

  • Not all of these items are applicable to all PRs, the author should update this template to only leave the boxes in that are relevant
  • Please make sure all these things are done and tick the boxes
# Author
- [ ] Changes are OpenShift compatible
- [ ] CRD changes approved
- [ ] CRD documentation for all fields, following the [style guide](https://docs.stackable.tech/home/nightly/contributor/docs/style-guide).
- [ ] Helm chart can be installed and deployed operator works
- [ ] Integration tests passed (for non trivial changes)
- [ ] Changes need to be "offline" compatible
# Reviewer
- [ ] Code contains useful comments
- [ ] Code contains useful logging statements
- [ ] (Integration-)Test cases added
- [ ] Documentation added or updated. Follows the [style guide](https://docs.stackable.tech/home/nightly/contributor/docs/style-guide).
- [ ] Changelog updated
- [ ] Cargo.toml only contains references to git tags (not specific commits or branches)
# Acceptance
- [ ] Feature Tracker has been updated
- [ ] Proper release label has been added
- [ ] [Roadmap](https://github.com/orgs/stackabletech/projects/25/views/1) has been updated

@sbernauer sbernauer self-assigned this Jan 15, 2025
@Maleware Maleware self-requested a review January 20, 2025 14:06
Copy link
Member

@Maleware Maleware left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM up to the optional nit comment :) Thanks!

@sbernauer sbernauer requested a review from Maleware January 22, 2025 08:31
Copy link
Member

@Maleware Maleware left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for your efforts :)

@sbernauer
Copy link
Member Author

Thanks for the review!

@sbernauer sbernauer added this pull request to the merge queue Jan 22, 2025
Merged via the queue into main with commit 92feb27 Jan 22, 2025
17 checks passed
@sbernauer sbernauer deleted the chore/coordinator-restarts branch January 22, 2025 09:54
@lfrancke
Copy link
Member

The decisions are not public. Can you please add a summary here?

@sbernauer
Copy link
Member Author

@lfrancke copied the decision into the description

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

4 participants