RFC: Better migration architecture #366

merlinz01 · 2024-11-21T20:13:29Z

From my experience using aerich in a production system, the current migration approach leaves much to be desired.

Here are my complaints with the current architecture:

Using raw SQL is not database-independent, so you are tied to using the same database type for testing and production.
Sometimes the SQL commands are out of order.
You can't do custom migration logic. All you can do is tweak a generated SQL script.

This is what I've come up with as a better approach:

I think a migration file should consist of the following items:

The current and previous migration ID
The old schema as a dict literal
The new schema as a dict literal
A list of migration operations such as field renames, table creation, etc. with database-specific information only where absolutely unavoidable.
A function which can be used to do manual data migrations in Python or SQL, and its complement for downgrades

Benefits of this approach:

Database-agnostic
A clear way of describing what is going to happen to the database without having to mess with SQL.
Custom upgrade and downgrade logic
The aerich tool can lint the operations so that the schema stays consistent even if a user edits the migration file.
You could ask aerich to generate an null migration file and then implement some data-only (non-schema) updates in the custom migration logic.

Conceptual example:

# This is a migration generated by the aerich tool.
# You may edit this file, but be warned that your changes will be overwritten
# if you run aerich migrate again without upgrading the database to this migration.

from tortoise import BaseDBAsyncClient

# This is the ID of the migration previous to this one, or None if this is the first.
PREVIOUS_MIGRATION_ID = '20241121101234'

# This is the ID of the migration, used to determine if the migration has been run.
MIGRATION_ID = '20241121142100'

# This is the schema of the database as of the last migration (maybe could be imported directly from previous migration?)
OLD_SCHEMA = {
    'MyModel': {
        'table_name': 'my_models',
        'fields': [
            'id': {
                'type': 'IntField',
                'pk': True,
                'null': False,
            },
            'name': {
                'type': 'CharField',
                'pk': False,
                'null': False,
                'max_len': 255,
            },
            ...
        ],
    },
    ....
}

# This is the schema of the database after this migration has been run.
NEW_SCHEMA = {
    'MyModel': {
        'table_name': 'my_models',
        'fields': [
            'id': {
                'type': 'IntField',
                'pk': True,
                'null': False,
            },
            'display_name': {
                'type': 'CharField',
                'pk': False,
                'null': False,
                'max_len': 255,
            },
            ...
        ],
    },
    ....
}

# These are directives to the migration executor about what the migration consists of.
MIGRATION_OPERATIONS = {
    {'type': 'rename_field', 'table': 'my_models', 'old_name': 'name', 'new_name': 'display_name'},
}

# This is where you can implement fancy migration logic if the simple cases don't meet your needs.
async def CUSTOM_MIGRATION_SCRIPT_UPGRADE(dbconn: BaseDBAsyncClient):
    # Example (pseudo-code):
    names = [(id, f"Hi, I'm {name}") for id, name in dbconn.execute('SELECT id, name FROM my_models').all()]

    # Wait for the migrations to be executed
    yield

    dbconn.executemany('UPDATE my_models SET display_name=? WHERE id=?', names)

# This is where you can implement the reverse logic in the same manner if you need to downgrade.
async def CUSTOM_MIGRATION_SCRIPT_DOWNGRADE(dbconn: BaseDBAsyncClient):
    yield

Migration pseudo-code:

# Evaluate the migration file
import migration_file

# Connect to the database
dbconn = get_db_conn()

# Check the previous migration ID against the database
if migration_file.PREVIOUS_MIGRATION_ID is not None and get_last_migration(dbconn) != migration_file.PREVIOUS_MIGRATION_ID:
    raise ValueError('You are running migrations out of order!')

# Make sure the operations update the schema correctly in case the user edited them
schema = migration_file.OLD_SCHEMA
for operation in migration_file.MIGRATION_OPERATIONS:
    modify_schema(schema, operation)
if schema != migration_file.NEW_SCHEMA:
    raise ValueError('The operations do not modify the schema correctly!')

# Run the first part of the user's upgrade logic
asyncgen = await migration_file.CUSTOM_MIGRATION_SCRIPT_UPGRADE(dbconn)
await anext(asyncgen)

# Run the migration operations
for operation in migration_file.MIGRATION_OPERATIONS:
    execute_migration_operation(operation, dbconn, downgrade=False)

# Run the second part of the user's upgrade logic
try:
    await anext(asyncgen)
except StopIteration:
    pass
else:
    raise RuntimeError('Only one yield statement is allowed!')

# Update the migrations
add_migration_to_migrations_table(dbconn, migration_file.MIGRATION_ID)

# Done

merlinz01 · 2024-11-21T20:16:25Z

Linking #76 #148 #243

henadzit · 2024-11-25T20:20:30Z

@merlinz01 thank you for sharing your ideas!

A few comments/questions:

Having both OLD_SCHEMA and NEW_SCHEMA would make the migration files quite large
How would you populate NEW_SCHEMA in the case where you are starting from an empty (null) migration?
Have you looked into other migrations tools? For instance, Django's migrations or alembic? What do you think about their approach to solving migrations?

merlinz01 · 2024-11-25T21:18:15Z

In answer to your questions.

Yes, having both old and new schemas would make the migration files significantly larger. The idea here is to reduce ambiguity for the migration tool, to eliminate the need for a database connection when generating a migration file, and to be able to check the operations against the new schema in case the user modifies one or the other. However, it should be workable to "import" the schema from the previous migration, because there is going to be trouble anyway if that file doesn't exist. So instead of OLD_SCHEMA = { ... } we should have something like OLD_SCHEMA = import_schema_from_migration_file(PREVIOUS_MIGRATION_ID).
If a user is not making schema changes, aerich could just generate NEW_SCHEMA = None or MIGRATION_OPERATIONS = None and use that as a signal to skip the schema migration stuff.
Django seems to employ a similar concept as far as defining operations instead of generating SQL. Sample Django migration files:

# Generated by Django 5.1.3 on 2024-11-25 20:41

from django.db import migrations, models


class Migration(migrations.Migration):

    initial = True

    dependencies = []

    operations = [
        migrations.CreateModel(
            name="TestModel",
            fields=[
                (
                    "id",
                    models.BigAutoField(
                        auto_created=True,
                        primary_key=True,
                        serialize=False,
                        verbose_name="ID",
                    ),
                ),
                ("name", models.CharField(max_length=100)),
                ("age", models.IntegerField()),
            ],
        ),
    ]

# Generated by Django 5.1.3 on 2024-11-25 20:42

from django.db import migrations


class Migration(migrations.Migration):

    dependencies = [
        ("testmigrations", "0001_initial"),
    ]

    operations = [
        migrations.RenameField(
            model_name="testmodel",
            old_name="age",
            new_name="age2",
        ),
    ]

Alembic is fairly similar to what I came up with. I like the fact that it enables type-checking on the operations.

"""test

Revision ID: f0f068244d58
Revises: e8fa71e392c3
Create Date: 2024-11-25 16:08:05.689241

"""
from typing import Sequence, Union

from alembic import op
import sqlalchemy as sa


# revision identifiers, used by Alembic.
revision: str = 'f0f068244d58'
down_revision: Union[str, None] = 'e8fa71e392c3'
branch_labels: Union[str, Sequence[str], None] = None
depends_on: Union[str, Sequence[str], None] = None


def upgrade() -> None:
    # ### commands auto generated by Alembic - please adjust! ###
    op.add_column('test_models', sa.Column('test', sa.String(length=20), nullable=False))
    # ### end Alembic commands ###


def downgrade() -> None:
    # ### commands auto generated by Alembic - please adjust! ###
    op.drop_column('test_models', 'test')
    # ### end Alembic commands ###

merlinz01 · 2024-12-16T19:33:51Z

Dropping this link here mostly for interest's sake: https://github.com/gobuffalo/fizz/blob/main/README.md

waketzheng added this to the v0.9 milestone Dec 12, 2024

waketzheng mentioned this issue Dec 23, 2024

Aerich puts all migrations in one folder, I expect per-app folders structure #171

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Better migration architecture #366

RFC: Better migration architecture #366

merlinz01 commented Nov 21, 2024

merlinz01 commented Nov 21, 2024

henadzit commented Nov 25, 2024

merlinz01 commented Nov 25, 2024

merlinz01 commented Dec 16, 2024

RFC: Better migration architecture #366

RFC: Better migration architecture #366

Comments

merlinz01 commented Nov 21, 2024

merlinz01 commented Nov 21, 2024

henadzit commented Nov 25, 2024

merlinz01 commented Nov 25, 2024

merlinz01 commented Dec 16, 2024