Zero downtime deployment

Introduction

Background

When we talk about “full-stack development”, we’re talking about the complete solution stack, frontend and backend, where each of those stacks may comprise multiple technologies or pieces of software. In FECfile+'s case, for example, the frontend includes CloudFront. the proxy, and the web app and the backend includes the APIs and the database.

When we think about performing a deploy, the simplest case is when only a single component in one stack is changed. For example:

We deploy an update to adjust some text in the web app, with no related change needed to the proxy or to anything in the backend.
We deploy an update to refactor API log message output that does not require or depend on changes to the database or to anything in the frontend.

Deploys are rarely so simple, though, particularly when they involve database migrations.

When we want to add a new feature, it will often involve database schema changes, maybe a new API, and corresponding frontend UI components.
When we change how a backend API works, we may have to adjust the frontend UI that utilizes that API.
An improvement to UI/UX may require a related adjustment to the APIs utilized.

Deploying changes to multiple components within a single stack has greater complexity than deploying a change to a single component, and performing a deploy that spans multiple components across the full stack is even more complex. With that complexity comes potential pitfalls and brittleness if not managed proactively.

Business Need

We have a goal of consistently achieving zero downtime deployments. These are deploys that do not cause or require any downtime and that are invisible to users. Apart from the obvious technical good, this is desirable for business reasons:

Enhanced UX: Users expect consistent and uninterrupted service; outages can cause frustration and loss of trust.
Operational Efficiency: If the deploy plan includes an outage it’s going to be scheduled for off-peak hours, like late at night. Not only does this mean having to coordinate adjusted developer hours but late-night or out-of-band periods are the worst times to perform important system changes with respect to developer performance.
Agility: The ability to deploy without causing downtime means you are less constrained in when you can deploy and therefore you can deploy more often and respond faster to customer needs.
Compliance: There may be statutory or SLA requirements limiting permissible downtime.
Team Morale: Less risk of disruption results in greater developer confidence and less timidity, resulting in attempting greater leaps.

ZDD in Abstract

What does Zero Downtime Deployment look like? To return to the simplest case, say that we’re deploying an update to one component, like some text in the web app or the formatting of a log message. During the deploy, one user will see the previous UI and a moment later, following the completion of the deploy, the next user to load the page will see the new one, or one API call will result in a log message in the prior format and the next, logged after the completion of the deploy, is in the new format. There is no interruption of service.

Once we have to coordinate deploying multiple components then we have to plan ahead to minimize disruption. For example, to deploy a completely new feature we might first deploy the backend necessary to support it and only then, once it was deployed and tested, deploy the frontend update. Deploying something new to the backend, before it can be used, is invisible to the users; the reverse would obviously make no sense -- you wouldn’t present the users with a new UI element before the backend change was in place to support it.

Database migrations: the complicating factor

Another factor to consider is how easy it is to roll back a deployment. If a deploy fails and causes disruption, or a deployed feature is later found to cause some issue, there needs to be a way to roll it back quickly without further disruption. Generally speaking, the complexity of deploying code-only changes to multiple components, or rolling back such a deployment, is negligible in comparison to that of deployments that include database migrations. It’s quick and easy to replace one set of files with another; it’s more laborious to alter a database schema, particularly when it already contains data, and even more so to roll back such a database change.

For this reason, prior work to outline impediments to zero downtime have focused on migrations: identifying potential issues arising from migrations or from a need to roll back a migration and how best to avoid or mitigate those issues. This essential work has led to improvements in our automations and processes.

This document has a broader scope and approaches the issue from a different angle, seeking instead to codify current practices and unspoken developer wisdom by classifying every possible type of deployment and outlining the best practices necessary to avoid downtime for each, moving approximately from least to most complex. The hope is that this will serve both as a comprehensive “recipe book” for deployments and, by codifying the conventional wisdom, surface any unexamined differences of opinion on best practices for discussion and collective improvement.

Note that there will be references in the following section to tagging an object for deletion. We currently do this through the django-deprecate-fields package, but further ideas about how to approach that are addressed in the section following that.

Deployment Recipes

Single-component Deploys

* unavoidably single stack

Non-migration deploys

e.g. new or updated APIs, app code, environment variables, etc. with no dependency on or of another component

Release: Deploy new version.

Rollback: Deploy previous version.

Migrations

Reversible database migrations

Release: Run migration by deploying the latest code.

Rollback: Deploy previous code, revert migration.

Non-reversible database migration: rename column or table

Release:

Add a new column/table to the database with the new name.
Copy data from old to new.
Tag the old one for deletion.

Rollback:

Add a new column/table to the database with the previous name.
Copy data from current to new.
Tag the current one for deletion.

Column renames are not reversible so would trigger complaints by the django-migrations-linter. By splitting the operation between multiple steps we can avoid that and enable reversing the change.

Non-reversible database migration: drop column or table

A column or table in the database might be empty because it is not in use yet (e.g. the API that would utilize it is not yet deployed) or it could contain data because it is used internally within the database by triggers or other tables.

Empty

Given that the column or table is empty, the inability to reverse these operations is moot.

Release: Deploy code telling the django-migration-linter to ignore the drop and then perform it.

Rollback: Deploy previous code and re-run the previous migration that created the column/table.

Non-empty

Dropping data is easy, but what if it was a mistake and we want to restore it? If it’s data from an external source then we can hopefully re-run the process that loaded the data originally. Otherwise the only thing we can do is restore from backup..

Release: Tag it for deletion.

Rollback: Re-load or restore from backup.

Non-reversible database migration: add a NOT NULL constraint

Add a CHECK constraint to ensure values are NOT NULL but keep it as NOT VALID.
Validate the constraint, ensuring all records are compliant.
Alter the column to set it as NOT NULL, utilizing the existing CHECK constraint to bypass full-table validation.
Drop the CHECK constraint if desired.

Non-reversible database migration: other column changes

Includes:

altering a column
adding a unique constraint to a column

Quick and simple

Release: Deploy code telling the django-migration-linter to ignore the database operation and then perform it.

Rollback: Deploy code telling the django-migration-linter to ignore the database operation and then perform it.

More complex but more reversible?

Release:

Rename the existing column (as described above)
Invoke the deletion of the original (-- no waiting)
Create a new column with the same name and the new definition
Copy data from the renamed column to the new column
Tag the renamed column for deletion

Rollback: Restore column from backup created when the deletion was invoked.

Multi-component, Single Stack Deploys

Change within frontend: web app and API proxy

Though we do not appear to be doing so now, we might one day wish to add some manner of rewrite/redirect In API proxy that would take the request from the web app and alter or augment the API call made. Why, in this hypothetical case where we are also deploying a change to the web app, we would not simply have the web app make the desired complete API call in the first place is left as an exercise to the reader. For the sake of completeness we address it.

Release: Deploy the proxy change, then the app change.

Rollback: Deploy the previous version of the app, then the previous version of the proxy.

Change within backend: API and database

Often a change to the API will go hand-in-hand with a related/supporting database change. In the ideal scenario our migrations and deployments would be effectively instantaneous, reversible, and failure-proof so we would do the following.

Release: Migrate the database, then deploy the updated API code.

Rollback: Revert the database migration, then deploy the previous API code.

Reality is not so tidy.

Deployments are not instantaneous. There will always be a period of flux where old and new code or schemas exist simultaneously.
If you migrate the database such that the new database schema is incompatible with the current API then you have, at best, downtime for however long it takes to deploy the new API, and at worst should the fielding of either component fail then you have downtime for as long as it takes to either roll back the change or fix and redeploy it.
Database migrations may lock tables during migration. Sometimes migrations may run long. While the tables are locked there will be errors and/or unexpected behavior if the app attempts to write to those tables. A failed or aborted database migration could leave tables locked following the migration.

When dealing with database migrations, everything above in Single-component Deploys > Migrations applies but with the addition of an API deploy.

[Example] Non-reversible database change: rename column or table

Release:

Add a column/table with the new name.
Copy the existing values over from old to new.
Deploy API update to use the new column/table name.
Tag the old one for deletion.

Rollback:

Add a column/table with the old name.
Copy the existing values over from current to old.
Deploy API update to use the old name.
Tag the current one for deletion.

Other deployments will follow the same pattern, with the API deploy happening after the migration completes.

Full-Stack Deploys

* unavoidably multi-component

New feature

If it’s a completely new feature then it should be simple, as there are no dependencies on any component being deployed outside of the deploy.

Release: Deploy backend, then deploy frontend.

Rollback: Deploy previous version of frontend, deploy previous version of backend.

Feature change

More likely we’re updating a feature in a way that involves a change to both the web app and the API (which may also include a change to the database).

Change API called by web app

(e.g. new API, update existing web app component to call new API)

Release: Deploy the new API and then the updated web app.

Rollback: Deploy the previous version of the web app, then the previous version of web-api without the new API.

Change API call by web app

(e.g. change to existing API, update web app to call existing API differently)

When deploying changes to an API, potential disruption can be minimized by coding the API and any database migrations to support both the old and new calls and differentiate between them based on the request schema or, barring that, a version identifier or other flag in the request. This way the previous version of the API remains supported while the new version is deployed

Release:

Deploy the new web-api code.
Deploy the new web-app.
After sufficient testing, deploy an updated web-api that removes support for the previous API version from the code and database.

Rollback:

If support for the previous version of the API has been removed then deploy the latest version of web-api with support for the previous and current API versions.
Deploy the previous version of web-app.
If it is determined that the new API version was a dead end that will not be re-attempted or returned to, deploy an updated web-api without support for the new API version.

Summary: migrations minefield

In summary, database migrations present challenges to zero downtime deployment due to potential issues including:

coordination between app and database versions
table locks
long-running migrations
ability to revert

Ideas toward Further Improvements

Improved database operations

Hot/warm database backups

We could dramatically decrease the near-term complexity of performing and reversing database migrations if we maintained two databases in parallel. In this scenario, during the normal operation the proxy reads from the primary database while writing to both databases. The deployment process can then be:

Call the current primary database "blue" and the secondary "green".
Migrate the green database.
Deploy the updated proxy.
As soon as the deployment is complete, the green database becomes primary.
Continue writing to both databases.

For the duration of the migration, the database being migrated is not being read by the proxy so there is no possible mismatch between app and database versions. This setup means that we can roll back changes by pushing the updated code and switching the primary back to (in the above example) the blue database.

Potential downsides:

Double the cost of the database layer
Might lose writes during long migrations?
Requires that updates to the proxy maintain backwards-compatibility by supporting the ability to write to both the old and new database schemas, resulting in some additional development overhead.

Alternatively, we could spin up the secondary database only during deployments and then tear it down after. This would be less expensive in terms of cost but could significantly lengthen migration time by requiring that the database be duplicated each time, which may not scale well after the application goes live (and the amount of data increases dramatically), and removes our ability to easily switch back to the prior database hot backup.

Cold database backups

Or, rather than utilize two databases, it may be less costly to maintain differential backups.

Throughout this document we have employed a strategy of creating new columns or tables, copying data over, and then tagging the old one for deletion rather than changing existing columns or tables in place. This allows us to avoid triggering an error from the django-migration-linter.

When we speak of tagging something for deletion we mean creating a ticket to add a migration to a future release that tells the django-migration-linter to ignore the migration and then drop the database object. This is preferable to dropping the database object through some out of band process, such as manually deleting it or running a separate process that monitors for tagged objects and then deletes them, because it keeps all of the changes in the same place, within the django migrations and auditable within the repository.

This process doesn’t address the underlying issue that the django-migration-linter would otherwise complain about, though, that performing a drop is not reversible. What if we want to roll back such a change? To restore a dropped column or table containing data we must have a backup from which to restore. To facilitate this, perhaps we could add a step to our process where drops are preceded by backing up the table or column, along with some automation to keep only the previous x backups of a given table or column.

In the case of a dropped table, the entire table could be restored from backup. But what if time has passed and the database has significantly diverged from the state when the backup was taken? If there are foreign key relationships between the restored table and others in the database then restoring the table could result in orphan rows that rely on records elsewhere in the database that no longer exist. Theoretically there may be an automated way to remove no-longer needed restored rows or add blank rows elsewhere to mock out required relationships.

In the case of a dropped column, theoretically the column could be re-added to the table and then an automated process could restore the backed up column data by interpolating it back into the table based on matching the primary key.

Improved rolling deployments?

With CloudFormation, “new instances share the same route as the old instances during the deploy, so there is no special routing or limiting”. This appears to prevent doing something like a rolling deployment where old instances are removed as new instances are deployed and, during the transition, traffic from the old web app is routed to an instance of the old API and traffic from the new web app is routed to a new API instance. If there were a way to orchestrate our pipelines to accomplish this in spite of CloudFormation’s lack of support, how could we handle database writes from old and new? If it could be done it would surely be too brittle to be worth it.

Django Migration Enhancements

The challenges to zero downtime deployments outlined above are not new. A number of projects have tried to implement ways of mitigating these issues, including the Django project itself.

SeparateDatabaseAndState

https://docs.djangoproject.com/en/5.2/ref/migration-operations/#separatedatabaseandstate
“A highly specialized operation that lets you mix and match the database (schema-changing) and state (autodetector-powering) aspects of operations.”

django-pg-zero-downtime-migrations package

This package modifies the Django PostgreSQL backend to apply migrations with a focus on minimizing locks.
https://github.com/tbicr/django-pg-zero-downtime-migrations

Another take on the same concept:
https://github.com/yandex/zero-downtime-migrations

Django syzygy package

https://github.com/charettes/django-syzygy
The syzygy Django package overrides the default makemigrations and migrate management commands and allows the developer to separate migrations into prerequisite and postponed migrations to control which run before the code is deployed and which run after all instances are done deploying, along with automatically postponing migrations that contain destructive operations.

Other:
https://github.com/ryanhiebert/django-safemigrate

Zero downtime deployment

Introduction

Background

Business Need

ZDD in Abstract

Database migrations: the complicating factor

Deployment Recipes

Single-component Deploys

Non-migration deploys

Migrations

Reversible database migrations

Non-reversible database migration: rename column or table

Non-reversible database migration: drop column or table

Empty

Non-empty

Non-reversible database migration: add a NOT NULL constraint

Non-reversible database migration: other column changes

Quick and simple

More complex but more reversible?

Multi-component, Single Stack Deploys

Change within frontend: web app and API proxy

Change within backend: API and database

[Example] Non-reversible database change: rename column or table

Full-Stack Deploys

New feature

Feature change

Change API called by web app

Change API call by web app

Summary: migrations minefield

Ideas toward Further Improvements

Improved database operations

Hot/warm database backups

Cold database backups

Improved rolling deployments?

Django Migration Enhancements

SeparateDatabaseAndState

django-pg-zero-downtime-migrations package

Django syzygy package

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally