Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request/Discussion] Splitting tenant schemas into pieces to run in multiple processes. #550

Open
kingbuzzman opened this issue Jun 20, 2018 · 11 comments

Comments

@kingbuzzman
Copy link

kingbuzzman commented Jun 20, 2018

The need has arisen to distribute our schema migrations, we have 150+ schemas and while parallel is a great add-on, and greatly speeds up our process, i'd like to propose a way to break migrations down further.

What i'm proposing is something like:

./manage.py migrate_schemas --shared  # do public schema alone..
./manage.py migrate_schemas --tenant --part 1 --of 3 &  # break tenant schema into 3 pieces and do the 1st piece
./manage.py migrate_schemas --tenant --part 2 --of 3 &
./manage.py migrate_schemas --tenant --part 3 --of 3 &

All the schemas would be retrieved, sorted by their pk, and split into parts, then the "part" you want would be the one that runs, using a good ol' python splice.

Naturally this wouldn't run on the same machine, my automated deployment would figure out how many servers i have to deploy the code, and divide the work once the public schema has been migrated.

What do you guys think?

@kingbuzzman kingbuzzman changed the title Splitting tenant schemas into pieces to run in multiple processes. [Feature Request/Discussion]Splitting tenant schemas into pieces to run in multiple processes. Jun 20, 2018
@kingbuzzman kingbuzzman changed the title [Feature Request/Discussion]Splitting tenant schemas into pieces to run in multiple processes. [Feature Request/Discussion] Splitting tenant schemas into pieces to run in multiple processes. Jun 20, 2018
@g-as
Copy link
Collaborator

g-as commented Jun 21, 2018

Sounds interesting.

The main issue coming to mind is how to handle the split in the event some other schemas are added while you're migrating all parts or migrating one part and not the other.

@kingbuzzman
Copy link
Author

kingbuzzman commented Jun 22, 2018

@AGASS007 LOL, don't do that in the tenant migrations.

Your two options on avoiding that are:

  1. create a management command to create new tenant members: (our choice)
./manage.py migrate_schemas --shared
./manage.py create_member --name foo --schema-name bar
./manage.py migrate_schemas --tenant --part 1 --of 3 &
....
  1. create your new members inside a public app migration. (im not going to lie, this is a very weird concept to me)

But my point is, don't create new schemas/delete schemas while you're running migration in parts. Do you have a particular use case for this? Right now, when we create a new tenant is a pretty big deal, and another team handles that, the dev team is responsible of the integrity as a whole, and not individual schemas.

Edit: This is what i have working so far: https://github.com/bernardopires/django-tenant-schemas/compare/master...kingbuzzman:migration-in-parts?expand=1 - still need tests and clean up the text abit.

@g-as
Copy link
Collaborator

g-as commented Jun 22, 2018

In our setup, new tenants are created through an API, which can be triggered anytime.

I'm fine with not addressing the potential change of number of tenants while migrating (i.e no built-in redundancy/extra margin in the chunks mechanism), but I think it should be made explicit.

@kingbuzzman
Copy link
Author

kingbuzzman commented Jun 22, 2018

@AGASS007 thinking about your problem a little more, wouldn't this get solved by using transactions?

if you wrap your create new tenant function, no one will see it until it gets committed. Which means, you will have to be REALLY unlucky to catch this issue while these guys are running in parts and you created a new migration. Ideally, these migrate commands would run within milliseconds of each other.

My proof of concept: (inside the test app)

from customers.models import Client
from django.db.transaction import atomic
with atomic():
  Client.objects.create(schema_name='e', domain_url='e')

In the meantime im running my migrate_schemas on a different session, looks like it works. No false positives, if i remove the atomic, i [can potentially] get issues depending on how many schemas you have / how many parts you're splitting it into.

@kingbuzzman
Copy link
Author

@AGASS007 Ive create the PR, any criticism would be appreciated.

#552

@xgilest
Copy link

xgilest commented Jul 20, 2018

Hi @kingbuzzman . at Txerpa we've taken a diferent aproach to the problem, @marija-milicevic has developed a new executor based on celery, a truly distributed schema migration, much better than our previous parallel executor. We have more than 2k schemas, and big migrations are a nightmare for us, thats why we had put so much effort on this problem.
I'm not sure if your solution is compatible with our approach by I belive so.
Please fill free to try if it works for you. We've made #553 pull request

@kingbuzzman
Copy link
Author

kingbuzzman commented Jul 20, 2018

@xgilest i dont know how i feel about delegating this task to celery. On the one hand, sure its a job queue... let it do it in the background... But this doesn't fix our releases -- the company i work for is very much: turn the app servers off after 9pm, throw a pretty 503, do the migrations as fast as possible, and when thats done, turn the app servers back on. If we were a bit more in the mindset of 0 downtime, your solution would probably work for us.

But even so i'd still have some reservations:

  1. looks like if i give you 2k tenants, you schedule 2k migrations jobs, if you have 2k workers (lets say) that means that youre going to pound the db pretty hard, no? How would you limit this to say: run 5 at a time? IO is taxing.

  2. what does this do to our normal celery tasks? everything needs to be forwards compatible. (currently we wait for all the tasks to finish after we throw out 503 and then the migrations start)

  3. The discipline for this is pretty high, this set-it-and-forget-it approach id love to embrace, but my bosses are very much cautious souls that wouldnt leave anything to chance. So much so, that before any deployment, we grab a prod db, get a MASSIVE aws spot instance, load it all in there and run migrations to ensure it will all work before the real deal.

Im not saying any of these are deal breakers.. Its just a different paradigm that scares a lot of large companies like mine.

How are you dealing with all this right now?

ps. your tests are failing.

@xgilest
Copy link

xgilest commented Jul 24, 2018

@kingbuzzman I do understand your reservations, how are we dealing with them:

  1. We don't have 2k workers, and if we did, we'll set a separated queue with only a limited number of workers that are going to consume this tasks.

  2. If you have a separated queue and still wait to start until there is no task running should not be a problem unless I'm missing something.

  3. Sure, we defently do the same: run our migrations before the real deal and check everything will run fine, once the checks are done we want it to happen as fast as posible ;)

ps. Tests are failing in dts, our changes doesn't break anythink that wasn't allready brake, but we've been unable to fix
the reported error is:

3.5 is not installed; attempting download
Downloading archive: https://s3.amazonaws.com/travis-python-archives/binaries/ubuntu/14.04/x86_64/python-3.5.tar.bz2

$ curl -sSf -o python-3.5.tar.bz2 ${archive_url}
curl: (56) SSL read: error:00000000:lib(0):func(0):reason(0), errno 104
Unable to download 3.5 archive. The archive may not exist. Please consider a different version.

Wich is very weird because it's not failing in other tests with python 3.5 any idea on how to fix will be apreciated

@andreburto
Copy link

andreburto commented Jul 24, 2018

@xgilest My one issue with the change by @marija-milicevic is the requirement of celery. Our project at work uses a different library for async tasks, which means with this change we'd be juggling two.

@positiveintent
Copy link

@andreburto - @xgilest actually has a very elegant solution to this. # of workers can be tuned PER machine-type and they'll just pull the next tenant as soon as they're ready. This means that this solution works the best if your tenants are not equal size (some take longer to migrate than others) or the worker machines are not the same (meaning that it's easier in a queue setup to tune workers to a machine and just let them go).

As far as WHICH queue to choose - I bet that it would be fairly simple to apply a strategy pattern here and write providers for any queue type. RabbitMQ, RQ, SQS, etc

@positiveintent
Copy link

Bonus points to sort the tenants by size (descending) so that the big tenants get picked up first. This is crude, but should reduce the overall time that the migrations take.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants