Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] reqs: project requirements #95

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

silopolis
Copy link
Contributor

@silopolis silopolis commented Dec 28, 2023

Fixes #10

@silopolis silopolis self-assigned this Dec 28, 2023
* Application router
* High-availability by fault tolerance
* Load-balancing by requests distribution
ALB ?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EKS uses ELB.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but which type? Application or Network LB?

Copy link
Contributor

@gitkvark gitkvark Dec 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to look into this. We're using an NLB, and we're actually setting it in our config for traefik:
service.beta.kubernetes.io/aws-load-balancer-type: nlb

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, thanks.
So we have the ELB's NLB doing L3/L4 load balancing on our Traefik ALB doing L7 route balancing, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct.


### Databases: PostgreSQL

* Helm chart
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're using Zalando's postgres-operator. The operator installs a CDR and watches the k8s cluster for manifests to initiate posgres clusters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK... And the operator is installed using Helm right ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

docs/project/requirements/architecture.md Outdated Show resolved Hide resolved
docs/project/requirements/architecture.md Outdated Show resolved Hide resolved

#### NAT Gateways (NGW)

* Per AZ egress
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our setup there's only 1 NGW for the whole cluster, to save a bit on costs. (This is configurable.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a SPOF to me, unless there's a failover mechanism to pop a NGW in another AZ in case of problem... Is there such a mechanism in place ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation says:
If single_nat_gateway = true, then all private subnets will route their Internet traffic through this single NAT gateway. The NAT gateway will be placed in the first public subnet in your public_subnets block.
This doesn't sound like automatic failover to me... It makes sense to move to one_nat_gateway_per_az = true.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done, code committed.


## Observability

### Log management (ELK/EFK)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't use ELK right now.

Copy link
Contributor Author

@silopolis silopolis Dec 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have some sort of log aggregation and management? Loki? Graylog? Bare rsyslog aggregation to start with?

I don't think this is a box that can be left unchecked.

Copy link
Contributor

@JoAngel8 JoAngel8 Dec 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Prometheus pull the metrics logs and grafana use is database to display in a dashbord the datas.
  • In other case, in a developpment and preprod environement K9S is used to read the logs.
  • We used also the dashboard to follow our budget who is limited to 80 USD for all the project.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is missing right now. I'll look into it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loki looks like a nice & cool NKOTB, but the fact that it doesn't indexes logs makes it a one of a kind I need to learn to know...
I tend to favor EFK over ELK because Fluentd and Fluentbit look lighter and leaner than that fat Java stash.
Finally, Graylog has always looked like a nice integrated, batteries included, solution.


### Log management (ELK/EFK)

### Metrics (Prometheus/Grafana)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use metrics from the following sources:

  • kubernetes (comes preinstalled with the kube-prometheus-stack helm package)
  • postgres exporter, which runs as sidecars along our postgres clusters
  • from our fastapi app using the Instrumentator

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the kube-prometheus-stack collects metrics about nodes?

What would be metrics monitoring black holes at this point?

  • EC2/Nodes
  • ELB/Traefick?
  • EBS ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's metrics collection happening about the nodes inside the cluster. Additional monitoring outside of the cluster could be useful (eg. EBS utilisation, k8s cluster health, etc.).
We have very little customisation to the default kube-prometheus-stack installation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean that kube monitoring stack doesn't watch k8s cluster health?!

I believe that a solid list of metrics (families or targets) that are or should be implemented is in order to show that we know where we have to keep our eyes.


### Metrics (Prometheus/Grafana)

### Event and alerting
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alerting is missing. We would need to add at least something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would we add if we had time? What would be the plan?
I believe it's more important to have an unimplemented plan than no plan at all ;)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Database master unplanned failover
  • CPU utilisation too high (per node and per critical pod)
  • Memory utilisation too high (per node and per critical pod)
  • Disk utilisation too high (EBS)
  • AZ zone lost
  • Database backup errors
  • Autoscaling limit hit (both EKS and pod autoscaling)
  • Too many pod restarts
  • Pods not getting scheduled due to resource issues
  • Application-specific issues (too many HTTP errors, etc.)

Just to name a few items.

Copy link
Contributor Author

@silopolis silopolis Dec 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few indeed ;)

Actually, I was wondering which alerting channel(s) and solution(s) would we implement?
Snail mail/e-mail/XMPP (remember ?)/Slack/cloud signals/AWS SNS...?

Any nice contender spotted? I used to like Sensu, but I still have to update my tech watch on that point...


### TODO Recap HA features

### Backup
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our image is immutable.
Our database is backed up automatically, using the built in feature of the postgres-operator.
There's a basebackup created every noon UTC, and the WAL is sent all through the day. The backup is stored in an S3 bucket in the US.

Copy link
Contributor Author

@silopolis silopolis Dec 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about:

  • Git repository backup
  • Images and other build artifacts (Helm chats, etc.) backup

So our Recovery Point Objective (RPO) is 1 day, right?

What's the backup location, conservation and rotation policy?
3/2/1?
Grandfather-father-son?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Git backup: we looked into it, but have nothing ready at this point.
Artifacts backup: no plan at the moment.
Our recovery is 1 day by default, plus we have WAL transfers every 16 MB (I think that's the default). We're using default values, we can be more specific, if needed.
The backup is stored in an S3 bucket in the US.
The backup rotation is keeping the last 5 backups at the moment. We might want to fine-tune it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

S3 looks like a very reliable backend, specially with CRR, and economically sound with Glacier transitioning.

It allows for implementation of 3/2/1 rule:

  • 3 copies: easy one
  • 2 medias: S3 + Glacier instant retrieval archive
  • 1 "offsite": Glacier flexible/deep archive

As well as the good old GFFS:

  • Son: daily to S3, rotated weekly/monthly
  • Father: weekly to glacier instant, rotated monthly/quarterly
  • Grandfather: monthly to glacier flexible, rotated bi-yearly/yearly

Oldest member of each generation could be transitioned to deep archive for an extended period of time.

So, what about:


### Backup

### Disaster Recovery
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The database loads the latest backup automatically when launched with an empty dataset (ie. during initiation phase).

Copy link
Contributor Author

@silopolis silopolis Dec 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a really nice feat!

So what's our Recovery Time Objective (RTO)?
How long does the production env take to build from scratch?
What's the procedure? How is it tested?

Copy link
Contributor

@gitkvark gitkvark Dec 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole stack comes online in about 20 minutes (including infra and pods). As our DB is small, the restore is fast. The data transfer happens within AWS's own network, the restore should be reasonably fast even with a larger dataset -- though beyond a GB or more I'd do further testing.
Yes, it's tested, we have repeatedly started and stopped the infra and it comes up online automatically.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All nice and sound!

Further on testing, how would we implement automated testing of our backups?
Some kind of "restore" env aside of [pre]prod, in another region? Or should we simply live test on DR site?

This leads me to our DR strategy... Are we doing cold or warm site?

@silopolis silopolis changed the title WIP: requirements [WIP] reqs: project requirements Dec 28, 2023
@silopolis silopolis force-pushed the silopolis/issue10 branch 6 times, most recently from 9d7eb2a to a60d4ae Compare December 29, 2023 20:21
Create design section
Move requirements from project to design section
Create sub-section for user stories to split them from requirements
Create architecture and specifications sub-sections
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

reqs: project requirements
3 participants