[Feature Request/Discussion] Splitting tenant schemas into pieces to run in multiple processes. #550

kingbuzzman · 2018-06-20T19:44:51Z

The need has arisen to distribute our schema migrations, we have 150+ schemas and while parallel is a great add-on, and greatly speeds up our process, i'd like to propose a way to break migrations down further.

What i'm proposing is something like:

./manage.py migrate_schemas --shared  # do public schema alone..
./manage.py migrate_schemas --tenant --part 1 --of 3 &  # break tenant schema into 3 pieces and do the 1st piece
./manage.py migrate_schemas --tenant --part 2 --of 3 &
./manage.py migrate_schemas --tenant --part 3 --of 3 &

All the schemas would be retrieved, sorted by their pk, and split into parts, then the "part" you want would be the one that runs, using a good ol' python splice.

Naturally this wouldn't run on the same machine, my automated deployment would figure out how many servers i have to deploy the code, and divide the work once the public schema has been migrated.

What do you guys think?

g-as · 2018-06-21T10:52:52Z

Sounds interesting.

The main issue coming to mind is how to handle the split in the event some other schemas are added while you're migrating all parts or migrating one part and not the other.

kingbuzzman · 2018-06-22T09:48:59Z

@AGASS007 LOL, don't do that in the tenant migrations.

Your two options on avoiding that are:

create a management command to create new tenant members: (our choice)

./manage.py migrate_schemas --shared
./manage.py create_member --name foo --schema-name bar
./manage.py migrate_schemas --tenant --part 1 --of 3 &
....

create your new members inside a public app migration. (im not going to lie, this is a very weird concept to me)

But my point is, don't create new schemas/delete schemas while you're running migration in parts. Do you have a particular use case for this? Right now, when we create a new tenant is a pretty big deal, and another team handles that, the dev team is responsible of the integrity as a whole, and not individual schemas.

Edit: This is what i have working so far: https://github.com/bernardopires/django-tenant-schemas/compare/master...kingbuzzman:migration-in-parts?expand=1 - still need tests and clean up the text abit.

g-as · 2018-06-22T13:04:04Z

In our setup, new tenants are created through an API, which can be triggered anytime.

I'm fine with not addressing the potential change of number of tenants while migrating (i.e no built-in redundancy/extra margin in the chunks mechanism), but I think it should be made explicit.

kingbuzzman · 2018-06-22T16:38:23Z

@AGASS007 thinking about your problem a little more, wouldn't this get solved by using transactions?

if you wrap your create new tenant function, no one will see it until it gets committed. Which means, you will have to be REALLY unlucky to catch this issue while these guys are running in parts and you created a new migration. Ideally, these migrate commands would run within milliseconds of each other.

My proof of concept: (inside the test app)

from customers.models import Client
from django.db.transaction import atomic
with atomic():
  Client.objects.create(schema_name='e', domain_url='e')

In the meantime im running my migrate_schemas on a different session, looks like it works. No false positives, if i remove the atomic, i [can potentially] get issues depending on how many schemas you have / how many parts you're splitting it into.

kingbuzzman · 2018-07-09T16:49:00Z

@AGASS007 Ive create the PR, any criticism would be appreciated.

#552

xgilest · 2018-07-20T15:06:21Z

Hi @kingbuzzman . at Txerpa we've taken a diferent aproach to the problem, @marija-milicevic has developed a new executor based on celery, a truly distributed schema migration, much better than our previous parallel executor. We have more than 2k schemas, and big migrations are a nightmare for us, thats why we had put so much effort on this problem.
I'm not sure if your solution is compatible with our approach by I belive so.
Please fill free to try if it works for you. We've made #553 pull request

kingbuzzman · 2018-07-20T16:03:16Z

@xgilest i dont know how i feel about delegating this task to celery. On the one hand, sure its a job queue... let it do it in the background... But this doesn't fix our releases -- the company i work for is very much: turn the app servers off after 9pm, throw a pretty 503, do the migrations as fast as possible, and when thats done, turn the app servers back on. If we were a bit more in the mindset of 0 downtime, your solution would probably work for us.

But even so i'd still have some reservations:

looks like if i give you 2k tenants, you schedule 2k migrations jobs, if you have 2k workers (lets say) that means that youre going to pound the db pretty hard, no? How would you limit this to say: run 5 at a time? IO is taxing.
what does this do to our normal celery tasks? everything needs to be forwards compatible. (currently we wait for all the tasks to finish after we throw out 503 and then the migrations start)
The discipline for this is pretty high, this set-it-and-forget-it approach id love to embrace, but my bosses are very much cautious souls that wouldnt leave anything to chance. So much so, that before any deployment, we grab a prod db, get a MASSIVE aws spot instance, load it all in there and run migrations to ensure it will all work before the real deal.

Im not saying any of these are deal breakers.. Its just a different paradigm that scares a lot of large companies like mine.

How are you dealing with all this right now?

ps. your tests are failing.

xgilest · 2018-07-24T10:12:31Z

@kingbuzzman I do understand your reservations, how are we dealing with them:

We don't have 2k workers, and if we did, we'll set a separated queue with only a limited number of workers that are going to consume this tasks.
If you have a separated queue and still wait to start until there is no task running should not be a problem unless I'm missing something.
Sure, we defently do the same: run our migrations before the real deal and check everything will run fine, once the checks are done we want it to happen as fast as posible ;)

ps. Tests are failing in dts, our changes doesn't break anythink that wasn't allready brake, but we've been unable to fix
the reported error is:

3.5 is not installed; attempting download
Downloading archive: https://s3.amazonaws.com/travis-python-archives/binaries/ubuntu/14.04/x86_64/python-3.5.tar.bz2

$ curl -sSf -o python-3.5.tar.bz2 ${archive_url}
curl: (56) SSL read: error:00000000:lib(0):func(0):reason(0), errno 104
Unable to download 3.5 archive. The archive may not exist. Please consider a different version.

Wich is very weird because it's not failing in other tests with python 3.5 any idea on how to fix will be apreciated

andreburto · 2018-07-24T15:26:25Z

@xgilest My one issue with the change by @marija-milicevic is the requirement of celery. Our project at work uses a different library for async tasks, which means with this change we'd be juggling two.

positiveintent · 2018-07-24T16:10:07Z

@andreburto - @xgilest actually has a very elegant solution to this. # of workers can be tuned PER machine-type and they'll just pull the next tenant as soon as they're ready. This means that this solution works the best if your tenants are not equal size (some take longer to migrate than others) or the worker machines are not the same (meaning that it's easier in a queue setup to tune workers to a machine and just let them go).

As far as WHICH queue to choose - I bet that it would be fairly simple to apply a strategy pattern here and write providers for any queue type. RabbitMQ, RQ, SQS, etc

positiveintent · 2018-07-24T17:01:16Z

Bonus points to sort the tenants by size (descending) so that the big tenants get picked up first. This is crude, but should reduce the overall time that the migrations take.

kingbuzzman changed the title ~~Splitting tenant schemas into pieces to run in multiple processes.~~ [Feature Request/Discussion]Splitting tenant schemas into pieces to run in multiple processes. Jun 20, 2018

kingbuzzman changed the title ~~[Feature Request/Discussion]Splitting tenant schemas into pieces to run in multiple processes.~~ [Feature Request/Discussion] Splitting tenant schemas into pieces to run in multiple processes. Jun 20, 2018

kingbuzzman mentioned this issue Jul 9, 2018

Migration in parts #552

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request/Discussion] Splitting tenant schemas into pieces to run in multiple processes. #550

[Feature Request/Discussion] Splitting tenant schemas into pieces to run in multiple processes. #550

kingbuzzman commented Jun 20, 2018 •

edited

Loading

g-as commented Jun 21, 2018

kingbuzzman commented Jun 22, 2018 •

edited

Loading

g-as commented Jun 22, 2018

kingbuzzman commented Jun 22, 2018 •

edited

Loading

kingbuzzman commented Jul 9, 2018

xgilest commented Jul 20, 2018

kingbuzzman commented Jul 20, 2018 •

edited

Loading

xgilest commented Jul 24, 2018

andreburto commented Jul 24, 2018 •

edited

Loading

positiveintent commented Jul 24, 2018

positiveintent commented Jul 24, 2018

[Feature Request/Discussion] Splitting tenant schemas into pieces to run in multiple processes. #550

[Feature Request/Discussion] Splitting tenant schemas into pieces to run in multiple processes. #550

Comments

kingbuzzman commented Jun 20, 2018 • edited Loading

g-as commented Jun 21, 2018

kingbuzzman commented Jun 22, 2018 • edited Loading

g-as commented Jun 22, 2018

kingbuzzman commented Jun 22, 2018 • edited Loading

kingbuzzman commented Jul 9, 2018

xgilest commented Jul 20, 2018

kingbuzzman commented Jul 20, 2018 • edited Loading

xgilest commented Jul 24, 2018

andreburto commented Jul 24, 2018 • edited Loading

positiveintent commented Jul 24, 2018

positiveintent commented Jul 24, 2018

kingbuzzman commented Jun 20, 2018 •

edited

Loading

kingbuzzman commented Jun 22, 2018 •

edited

Loading

kingbuzzman commented Jun 22, 2018 •

edited

Loading

kingbuzzman commented Jul 20, 2018 •

edited

Loading

andreburto commented Jul 24, 2018 •

edited

Loading