Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update disaster recovery guidance #198

Merged
merged 1 commit into from
Jun 28, 2024
Merged

Conversation

RMcVelia
Copy link
Contributor

@RMcVelia RMcVelia requested a review from a team as a code owner June 26, 2024 09:35
Copy link

github-actions bot commented Jun 26, 2024

Review app https://technical-guidance-198.test.teacherservices.cloud was deleted

@RMcVelia
Copy link
Contributor Author

@@ -7,7 +7,7 @@ weight: 40

<%= partial('partials/page_toc') %>

This document is intended to list technical risks to our digital services and the mitigations we have in place.
This document is intended to list technical risks to our digital services and the mitigations we have in place.<br/><br/>For any issue affecting a service (or services) always [check if there are any dependant services](https://educationgovuk.sharepoint.com/sites/teacher-services-infrastructure/SitePages/Teacher-services-dependencies.aspx) that may also be affected.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This document is intended to list technical risks to our digital services and the mitigations we have in place.<br/><br/>For any issue affecting a service (or services) always [check if there are any dependant services](https://educationgovuk.sharepoint.com/sites/teacher-services-infrastructure/SitePages/Teacher-services-dependencies.aspx) that may also be affected.
This document is intended to list technical risks to our digital services and the mitigations we have in place.<br/><br/>For any issue affecting a service (or services) always [check if there are any dependent services](https://educationgovuk.sharepoint.com/sites/teacher-services-infrastructure/SitePages/Teacher-services-dependencies.aspx) that may also be affected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

|Impact|Applications may be unavailable. Data may be lost.|
|Prevention|Approved PIM request required for production Azure access.<br/>Pull Requests require at least 1 approval.<br/>Soft delete and versioning enabled for key vaults and storage accounts.<br/>Azure resource locks placed on important resources [Azure locks](https://learn.microsoft.com/en-us/azure/azure-resource-manager/management/lock-resources?tabs=json). |
|Detection|Endpoint monitoring may point to a healthcheck page that is now failing. Or smoke tests running in production may detect it.|
|Remediation|Recovery dependant on the resource deleted, either restore correct version or redeploy and restore data from backup|

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
|Remediation|Recovery dependant on the resource deleted, either restore correct version or redeploy and restore data from backup|
|Remediation|Recovery dependent on the resource deleted, either restore correct version or redeploy and restore data from backup|

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry! a dependant is a person who is dependent on someone

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, and fixed

## Loss of Azure/AWS availability zone
We deploy to the UK South or West Europe regions which have 3 separate availability zones (AZ). It may happen that one of them is unavailable: either network, compute or storage services are affected.

|||
|-|-|
|Impact|Applications may be slow or unavailable|
|Prevention|Applications should be built with failure in mind: deploy multiple application instances and deploy databases in cluster mode. Spread them across multiple AZs for high availability.<br/>Our AKS clusters are spread across 3 AZs. Scale applications to more than 1 replicas and enable zone redundancy.|
|Prevention|Applications should be built with failure in mind: AKS clusters should be configured with nodes spread across multiple AZS. AKS Deployments should use a zone topology spread constraint. Scale applications to more than 1 replica to enable zone redundancy.<br/>Azure storage accounts should be ZRS or GZRS if zone redundancy is required (automatic and manual failover).<br/>Azure key vault uses replication within region and to a paired region (automatic failover).<br/>Azure Postgres can be configured zone redundant, with the active/standby instances in different zones.<br/>Azure Redis can be zone redundant only if using the Premium SKU.<br/>Postgres and Redis utilise automatic failover. Postgres can also be failed over manually.<br/>Cluster Public IP addresses should be configured with zone redundancy [Azure PIP redundancy](https://learn.microsoft.com/en-us/azure/virtual-network/ip-services/public-ip-addresses#availability-zone)|

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
|Prevention|Applications should be built with failure in mind: AKS clusters should be configured with nodes spread across multiple AZS. AKS Deployments should use a zone topology spread constraint. Scale applications to more than 1 replica to enable zone redundancy.<br/>Azure storage accounts should be ZRS or GZRS if zone redundancy is required (automatic and manual failover).<br/>Azure key vault uses replication within region and to a paired region (automatic failover).<br/>Azure Postgres can be configured zone redundant, with the active/standby instances in different zones.<br/>Azure Redis can be zone redundant only if using the Premium SKU.<br/>Postgres and Redis utilise automatic failover. Postgres can also be failed over manually.<br/>Cluster Public IP addresses should be configured with zone redundancy [Azure PIP redundancy](https://learn.microsoft.com/en-us/azure/virtual-network/ip-services/public-ip-addresses#availability-zone)|
|Prevention|Applications should be built with failure in mind: AKS clusters should be configured with nodes spread across multiple AZs. AKS Deployments should use a zone topology spread constraint. Scale applications to more than 1 replica to enable zone redundancy.<br/>Azure storage accounts should be ZRS or GZRS if zone redundancy is required (automatic and manual failover).<br/>Azure key vault uses replication within region and to a paired region (automatic failover).<br/>Azure Postgres can be configured zone redundant, with the active/standby instances in different zones.<br/>Azure Redis can be zone redundant only if using the Premium SKU.<br/>Postgres and Redis utilise automatic failover. Postgres can also be failed over manually.<br/>Cluster Public IP addresses should be configured with zone redundancy [Azure PIP redundancy](https://learn.microsoft.com/en-us/azure/virtual-network/ip-services/public-ip-addresses#availability-zone)|

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -65,7 +75,7 @@ In some rare cases, an entire region might become unavailable.
|||
|-|-|
|Impact|Applications may be unavailable|
|Prevention|For critical applications, it is possible to deploy to 2 different regions, synchronise the data, configure a DNS based failover or GSLB. We don’t usually protect against this risk as it is not worth the complexity of the required set-up.|
|Prevention|For critical applications, it is possible to deploy to 2 different regions, synchronise the data, configure a DNS based failover or GSLB. We don’t usually protect against this risk as it is not worth the complexity of the required set-up.<br/>Production Postgres backups are kept in a GRS Azure storage account, which maintains copies of data in a separate reqion. Any critcal application data kept in a storage account should be GRS/GZRS.<br/>Azure Key Vault maintains a copy of the contents in another region.<br/>For storage accounts and key vaults, failover is automatic and transparent. Storage accounts also support manual failover. |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
|Prevention|For critical applications, it is possible to deploy to 2 different regions, synchronise the data, configure a DNS based failover or GSLB. We don’t usually protect against this risk as it is not worth the complexity of the required set-up.<br/>Production Postgres backups are kept in a GRS Azure storage account, which maintains copies of data in a separate reqion. Any critcal application data kept in a storage account should be GRS/GZRS.<br/>Azure Key Vault maintains a copy of the contents in another region.<br/>For storage accounts and key vaults, failover is automatic and transparent. Storage accounts also support manual failover. |
|Prevention|For critical applications, it is possible to deploy to 2 different regions, synchronise the data, configure a DNS based failover or GSLB. We don’t usually protect against this risk as it is not worth the complexity of the required set-up.<br/>Production Postgres backups are kept in a GRS Azure storage account, which maintains copies of data in a separate region. Any critical application data kept in a storage account should be GRS/GZRS.<br/>Azure Key Vault maintains a copy of the contents in another region.<br/>For storage accounts and key vaults, failover is automatic and transparent. Storage accounts also support manual failover. |

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@RMcVelia RMcVelia merged commit ee1c400 into master Jun 28, 2024
2 checks passed
@RMcVelia RMcVelia deleted the 1860-dr-update-tech-guidance branch June 28, 2024 09:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants