[WIP] reqs: project requirements #95

silopolis · 2023-12-28T13:51:04Z

Fixes #10

gitkvark · 2023-12-28T14:37:08Z

docs/project/requirements/architecture.md

+* Application router
+* High-availability by fault tolerance
+* Load-balancing by requests distribution
+  ALB ?


EKS uses ELB.

Yes, but which type? Application or Network LB?

I had to look into this. We're using an NLB, and we're actually setting it in our config for traefik:
service.beta.kubernetes.io/aws-load-balancer-type: nlb

OK, thanks.
So we have the ELB's NLB doing L3/L4 load balancing on our Traefik ALB doing L7 route balancing, right?

gitkvark · 2023-12-28T14:38:36Z

docs/project/requirements/architecture.md

+
+### Databases: PostgreSQL
+
+* Helm chart


We're using Zalando's postgres-operator. The operator installs a CDR and watches the k8s cluster for manifests to initiate posgres clusters.

OK... And the operator is installed using Helm right ?

docs/project/requirements/architecture.md

gitkvark · 2023-12-28T14:40:11Z

docs/project/requirements/architecture.md

+
+#### NAT Gateways (NGW)
+
+* Per AZ egress


In our setup there's only 1 NGW for the whole cluster, to save a bit on costs. (This is configurable.)

Looks like a SPOF to me, unless there's a failover mechanism to pop a NGW in another AZ in case of problem... Is there such a mechanism in place ?

The documentation says:
If single_nat_gateway = true, then all private subnets will route their Internet traffic through this single NAT gateway. The NAT gateway will be placed in the first public subnet in your public_subnets block.
This doesn't sound like automatic failover to me... It makes sense to move to one_nat_gateway_per_az = true.

This is done, code committed.

gitkvark · 2023-12-28T14:50:08Z

docs/project/requirements/architecture.md

+
+## Observability
+
+### Log management (ELK/EFK)


We don't use ELK right now.

Do we have some sort of log aggregation and management? Loki? Graylog? Bare rsyslog aggregation to start with?

I don't think this is a box that can be left unchecked.

Prometheus pull the metrics logs and grafana use is database to display in a dashbord the datas.

In other case, in a developpment and preprod environement K9S is used to read the logs.

We used also the dashboard to follow our budget who is limited to 80 USD for all the project.

This is missing right now. I'll look into it.

Loki looks like a nice & cool NKOTB, but the fact that it doesn't indexes logs makes it a one of a kind I need to learn to know...
I tend to favor EFK over ELK because Fluentd and Fluentbit look lighter and leaner than that fat Java stash.
Finally, Graylog has always looked like a nice integrated, batteries included, solution.

gitkvark · 2023-12-28T14:51:57Z

docs/project/requirements/architecture.md

+
+### Log management (ELK/EFK)
+
+### Metrics (Prometheus/Grafana)


We use metrics from the following sources:

kubernetes (comes preinstalled with the kube-prometheus-stack helm package)

postgres exporter, which runs as sidecars along our postgres clusters

from our fastapi app using the Instrumentator

Does the kube-prometheus-stack collects metrics about nodes?

What would be metrics monitoring black holes at this point?

EC2/Nodes

ELB/Traefick?

EBS ?

There's metrics collection happening about the nodes inside the cluster. Additional monitoring outside of the cluster could be useful (eg. EBS utilisation, k8s cluster health, etc.).
We have very little customisation to the default kube-prometheus-stack installation.

You mean that kube monitoring stack doesn't watch k8s cluster health?!

I believe that a solid list of metrics (families or targets) that are or should be implemented is in order to show that we know where we have to keep our eyes.

gitkvark · 2023-12-28T14:52:21Z

docs/project/requirements/architecture.md

+
+### Metrics (Prometheus/Grafana)
+
+### Event and alerting


Alerting is missing. We would need to add at least something.

What would we add if we had time? What would be the plan?
I believe it's more important to have an unimplemented plan than no plan at all ;)

Database master unplanned failover

CPU utilisation too high (per node and per critical pod)

Memory utilisation too high (per node and per critical pod)

Disk utilisation too high (EBS)

AZ zone lost

Database backup errors

Autoscaling limit hit (both EKS and pod autoscaling)

Too many pod restarts

Pods not getting scheduled due to resource issues

Application-specific issues (too many HTTP errors, etc.)

Just to name a few items.

A few indeed ;)

Actually, I was wondering which alerting channel(s) and solution(s) would we implement?
Snail mail/e-mail/XMPP (remember ?)/Slack/cloud signals/AWS SNS...?

Any nice contender spotted? I used to like Sensu, but I still have to update my tech watch on that point...

gitkvark · 2023-12-28T14:53:55Z

docs/project/requirements/architecture.md

+
+### TODO Recap HA features
+
+### Backup


Our image is immutable.
Our database is backed up automatically, using the built in feature of the postgres-operator.
There's a basebackup created every noon UTC, and the WAL is sent all through the day. The backup is stored in an S3 bucket in the US.

What about:

Git repository backup

Images and other build artifacts (Helm chats, etc.) backup

So our Recovery Point Objective (RPO) is 1 day, right?

What's the backup location, conservation and rotation policy?
3/2/1?
Grandfather-father-son?

Git backup: we looked into it, but have nothing ready at this point.
Artifacts backup: no plan at the moment.
Our recovery is 1 day by default, plus we have WAL transfers every 16 MB (I think that's the default). We're using default values, we can be more specific, if needed.
The backup is stored in an S3 bucket in the US.
The backup rotation is keeping the last 5 backups at the moment. We might want to fine-tune it.

S3 looks like a very reliable backend, specially with CRR, and economically sound with Glacier transitioning.

It allows for implementation of 3/2/1 rule:

3 copies: easy one

2 medias: S3 + Glacier instant retrieval archive

1 "offsite": Glacier flexible/deep archive

As well as the good old GFFS:

Son: daily to S3, rotated weekly/monthly

Father: weekly to glacier instant, rotated monthly/quarterly

Grandfather: monthly to glacier flexible, rotated bi-yearly/yearly

Oldest member of each generation could be transitioned to deep archive for an extended period of time.

So, what about:

GitHub backup to S3

https://github.com/marketplace/actions/s3-backup

https://rewind.com/blog/backing-up-your-github-repository-to-multiple-aws-s3-regions/

https://blog.marcolancini.it/2021/blog-github-backups-with-ecs/

Deployment artifacts (container images): CRR and lifecycle policies

https://disaster-recovery.workshop.aws/en/services/containers/ecr/ecr.html

https://aws.amazon.com/blogs/compute/clean-up-your-container-images-with-amazon-ecr-lifecycle-policies/

https://docs.aws.amazon.com/AmazonECR/latest/userguide/LifecyclePolicies.html

EBS: one day looks like an eternity to me, could we add a pinch of snapshoting into that pot? Something like hourly

gitkvark · 2023-12-28T14:54:50Z

docs/project/requirements/architecture.md

+
+### Backup
+
+### Disaster Recovery


The database loads the latest backup automatically when launched with an empty dataset (ie. during initiation phase).

That's a really nice feat!

So what's our Recovery Time Objective (RTO)?
How long does the production env take to build from scratch?
What's the procedure? How is it tested?

The whole stack comes online in about 20 minutes (including infra and pods). As our DB is small, the restore is fast. The data transfer happens within AWS's own network, the restore should be reasonably fast even with a larger dataset -- though beyond a GB or more I'd do further testing.
Yes, it's tested, we have repeatedly started and stopped the infra and it comes up online automatically.

All nice and sound!

Further on testing, how would we implement automated testing of our backups?
Some kind of "restore" env aside of [pre]prod, in another region? Or should we simply live test on DR site?

This leads me to our DR strategy... Are we doing cold or warm site?

Create design section Move requirements from project to design section Create sub-section for user stories to split them from requirements Create architecture and specifications sub-sections

silopolis self-assigned this Dec 28, 2023

gitkvark reviewed Dec 28, 2023

View reviewed changes

silopolis changed the title ~~WIP: requirements~~ [WIP] reqs: project requirements Dec 28, 2023

gitkvark approved these changes Dec 29, 2023

View reviewed changes

silopolis force-pushed the silopolis/issue10 branch 6 times, most recently from 9d7eb2a to a60d4ae Compare December 29, 2023 20:21

silopolis added 3 commits January 2, 2024 12:58

docs: fix broken links

c6d36c5

docs: fix heading levels

154095f

docs:planning: fix date

cd89fbe

silopolis force-pushed the silopolis/issue10 branch from a60d4ae to 4a20f72 Compare January 2, 2024 12:01

WIP: requirements

29d0527

silopolis force-pushed the silopolis/issue10 branch from 4a20f72 to 29d0527 Compare January 2, 2024 13:19

silopolis added 4 commits January 2, 2024 17:33

docs: fix sections landing pages and move app overview into project docs

88ec945

docs:project: various fixes

5220d41

docs:reqs: various fixes

17dd767

docs: create design section

66b8419

Create design section Move requirements from project to design section Create sub-section for user stories to split them from requirements Create architecture and specifications sub-sections


		### Log management (ELK/EFK)

		### Metrics (Prometheus/Grafana)

[WIP] reqs: project requirements #95

Are you sure you want to change the base?

[WIP] reqs: project requirements #95

Uh oh!

Conversation

silopolis commented Dec 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gitkvark Dec 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

silopolis Dec 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JoAngel8 Dec 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

silopolis Dec 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

silopolis Dec 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

silopolis commented Dec 28, 2023 •

edited

Loading

gitkvark Dec 29, 2023 •

edited

Loading

silopolis Dec 29, 2023 •

edited

Loading

JoAngel8 Dec 29, 2023 •

edited

Loading

silopolis Dec 29, 2023 •

edited

Loading

silopolis Dec 29, 2023 •

edited

Loading

silopolis Dec 29, 2023 •

edited

Loading

gitkvark Dec 29, 2023 •

edited

Loading