Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bulk create is slow #1799

Open
Alercard opened this issue Dec 3, 2024 · 12 comments
Open

Bulk create is slow #1799

Alercard opened this issue Dec 3, 2024 · 12 comments

Comments

@Alercard
Copy link

Alercard commented Dec 3, 2024

Describe the bug
I check when my application tries to insert 720 registers Tortoise takes around 7minutes. I tried to send the conecction and batch_size paramters but this didn't make a difference.

To Reproduce
I coded the next function:

instance_array = []
for row in data:
      instance = self.model_core(**row)
      instance_array.append(instance)
result = await self.model_core.bulk_create(instance_array) #, batch_size=500, using_db=conn) # I tried to use batch_size and conn from transaction but it isn't work

model_core is a Models from Tortoise.Models.Model and its code is:

from tortoise import fields
from tortoise.models import Model

class CoreMidPolitics(Model):
    class Meta:
        table = "CORE_MID_POLITICAS"
        
    id = fields.BigIntField(pk=True)
    tipo_politica = fields.CharField(max_length=25)
    orden = fields.CharField(max_length=30)
    formapago_id = fields.CharField(max_length=8)
    solicitud_id = fields.BigIntField()
    solicitud_detalle_id = fields.BigIntField()
    vigente_desde = fields.DateField()
    vigente_hasta = fields.DateField()
    fecha_aprobacion = fields.DatetimeField()
    ciclo = fields.CharField(max_length=25, null=True)
    valor = fields.DecimalField(max_digits=18, decimal_places=4)
    valor_unidad = fields.CharField(max_length=20)
    jgp_peso = fields.IntField(null=True)
    empresa_id = fields.CharField(max_length=50, null=True)
    unidadnegocio_id = fields.CharField(max_length=50, null=True)
    linea_id = fields.BigIntField(null=True)
    grupo_id = fields.BigIntField(null=True)
    subgrupo_id = fields.BigIntField(null=True)
    capacidad_id = fields.BigIntField(null=True)
    marca_id = fields.CharField(max_length=50, null=True)
    ciclovida_id = fields.CharField(max_length=50, null=True)
    producto_id = fields.CharField(max_length=50, null=True)
    seccion_valor = fields.CharField(max_length=100, null=True)
    seccion_unidad = fields.CharField(max_length=100, null=True)

Expected behavior
I expect the bulk_insert process will be faster as the benchamark saids. I tried to insert 720 registers by bulk_create method but It takes around 7minutes.

Additional context
I use SQL Server

@Alercard
Copy link
Author

Alercard commented Dec 3, 2024

Checking the destiny table, I saw registers are inserted 1 or 2 registers per second. That's too slow! This table only have 1 PK and 2 nonclustered indexes but I disabled them and I got the same result.

@Abdeldjalil-H
Copy link
Contributor

Abdeldjalil-H commented Dec 4, 2024

Hi @Alercard.
Have you tried an insert without an ORM? What database are you using?

Also is your code like this?

instance_array = []
for row in data:
    instance = self.model_core(**row)
    instance_array.append(instance)
    result = await self.model_core.bulk_create(instance_array, batch_size=500, using_db=conn)

or this?

instance_array = []
for row in data:
    instance = self.model_core(**row)
    instance_array.append(instance)
result = await self.model_core.bulk_create(instance_array, batch_size=500, using_db=conn)

@henadzit
Copy link
Contributor

henadzit commented Dec 4, 2024

@Alercard A few things that would would help you to get help faster from the community:

  • Do not delete the issue template but actually fill the sections of the template.
  • Provide the reproducible code example including the information on which database you use and how your model looks.

@Alercard
Copy link
Author

Alercard commented Dec 4, 2024

I updated the issue with the template!

@Alercard
Copy link
Author

Alercard commented Dec 4, 2024

Hi @Alercard. Have you tried an insert without an ORM? What database are you using?

Also is your code like this?

instance_array = []
for row in data:
    instance = self.model_core(**row)
    instance_array.append(instance)
    result = await self.model_core.bulk_create(instance_array, batch_size=500, using_db=conn)

or this?

instance_array = []
for row in data:
    instance = self.model_core(**row)
    instance_array.append(instance)
result = await self.model_core.bulk_create(instance_array, batch_size=500, using_db=conn)

The second one

@henadzit
Copy link
Contributor

henadzit commented Dec 4, 2024

@Alercard please let us know which database and driver you use, it's quite important to understand the issue.

I prepared a fully reproducible example with your model and SQLite. It took 0.03 seconds to insert 720 records:

import random
import time
from datetime import datetime, timedelta

from tortoise import Tortoise, fields, run_async
from tortoise.models import Model


class CoreMidPolitics(Model):
    class Meta:
        table = "CORE_MID_POLITICAS"

    id = fields.BigIntField(pk=True)
    tipo_politica = fields.CharField(max_length=25)
    orden = fields.CharField(max_length=30)
    formapago_id = fields.CharField(max_length=8)
    solicitud_id = fields.BigIntField()
    solicitud_detalle_id = fields.BigIntField()
    vigente_desde = fields.DateField()
    vigente_hasta = fields.DateField()
    fecha_aprobacion = fields.DatetimeField()
    ciclo = fields.CharField(max_length=25, null=True)
    valor = fields.DecimalField(max_digits=18, decimal_places=4)
    valor_unidad = fields.CharField(max_length=20)
    jgp_peso = fields.IntField(null=True)
    empresa_id = fields.CharField(max_length=50, null=True)
    unidadnegocio_id = fields.CharField(max_length=50, null=True)
    linea_id = fields.BigIntField(null=True)
    grupo_id = fields.BigIntField(null=True)
    subgrupo_id = fields.BigIntField(null=True)
    capacidad_id = fields.BigIntField(null=True)
    marca_id = fields.CharField(max_length=50, null=True)
    ciclovida_id = fields.CharField(max_length=50, null=True)
    producto_id = fields.CharField(max_length=50, null=True)
    seccion_valor = fields.CharField(max_length=100, null=True)
    seccion_unidad = fields.CharField(max_length=100, null=True)


async def run():
    await Tortoise.init(db_url="sqlite://:memory:", modules={"models": ["__main__"]})
    await Tortoise.generate_schemas()

    records = []
    for _ in range(720):
        records.append(
            CoreMidPolitics(
                tipo_politica=random.choice(["PolicyA", "PolicyB", "PolicyC"]),
                orden=f"ORD-{random.randint(1000, 9999)}",
                formapago_id=f"FP{random.randint(1, 50):02}",
                solicitud_id=random.randint(1, 10000),
                solicitud_detalle_id=random.randint(1, 10000),
                vigente_desde=datetime.now().date(),
                vigente_hasta=(datetime.now() + timedelta(days=random.randint(30, 365))).date(),
                fecha_aprobacion=datetime.now(),
                ciclo=random.choice([None, "Monthly", "Quarterly", "Annual"]),
                valor=round(random.uniform(10.0, 1000.0), 4),
                valor_unidad=random.choice(["UnitA", "UnitB", "UnitC"]),
                jgp_peso=random.choice([None, random.randint(1, 10)]),
                empresa_id=random.choice([None, f"Emp{random.randint(1, 50)}"]),
                unidadnegocio_id=random.choice([None, f"UN-{random.randint(1, 50)}"]),
                linea_id=random.choice([None, random.randint(1, 1000)]),
                grupo_id=random.choice([None, random.randint(1, 1000)]),
                subgrupo_id=random.choice([None, random.randint(1, 1000)]),
                capacidad_id=random.choice([None, random.randint(1, 1000)]),
                marca_id=random.choice([None, f"Brand{random.randint(1, 50)}"]),
                ciclovida_id=random.choice([None, f"LifeCycle{random.randint(1, 20)}"]),
                producto_id=random.choice([None, f"Prod{random.randint(1, 100)}"]),
                seccion_valor=random.choice([None, "Section1", "Section2", "Section3"]),
                seccion_unidad=random.choice([None, "SectionUnit1", "SectionUnit2"]),
            )
        )

    start = time.time()
    await CoreMidPolitics.bulk_create(records, batch_size=500)
    print(f"insert took {time.time() - start:5f}")

    print(await CoreMidPolitics.all().count())


if __name__ == "__main__":
    run_async(run())

Is it possible that the issue is related to slow database, bad connectivity between the app and the database?

@Alercard
Copy link
Author

Alercard commented Dec 5, 2024

@Alercard please let us know which database and driver you use, it's quite important to understand the issue.

I prepared a fully reproducible example with your model and SQLite. It took 0.03 seconds to insert 720 records:

import random
import time
from datetime import datetime, timedelta

from tortoise import Tortoise, fields, run_async
from tortoise.models import Model


class CoreMidPolitics(Model):
    class Meta:
        table = "CORE_MID_POLITICAS"

    id = fields.BigIntField(pk=True)
    tipo_politica = fields.CharField(max_length=25)
    orden = fields.CharField(max_length=30)
    formapago_id = fields.CharField(max_length=8)
    solicitud_id = fields.BigIntField()
    solicitud_detalle_id = fields.BigIntField()
    vigente_desde = fields.DateField()
    vigente_hasta = fields.DateField()
    fecha_aprobacion = fields.DatetimeField()
    ciclo = fields.CharField(max_length=25, null=True)
    valor = fields.DecimalField(max_digits=18, decimal_places=4)
    valor_unidad = fields.CharField(max_length=20)
    jgp_peso = fields.IntField(null=True)
    empresa_id = fields.CharField(max_length=50, null=True)
    unidadnegocio_id = fields.CharField(max_length=50, null=True)
    linea_id = fields.BigIntField(null=True)
    grupo_id = fields.BigIntField(null=True)
    subgrupo_id = fields.BigIntField(null=True)
    capacidad_id = fields.BigIntField(null=True)
    marca_id = fields.CharField(max_length=50, null=True)
    ciclovida_id = fields.CharField(max_length=50, null=True)
    producto_id = fields.CharField(max_length=50, null=True)
    seccion_valor = fields.CharField(max_length=100, null=True)
    seccion_unidad = fields.CharField(max_length=100, null=True)


async def run():
    await Tortoise.init(db_url="sqlite://:memory:", modules={"models": ["__main__"]})
    await Tortoise.generate_schemas()

    records = []
    for _ in range(720):
        records.append(
            CoreMidPolitics(
                tipo_politica=random.choice(["PolicyA", "PolicyB", "PolicyC"]),
                orden=f"ORD-{random.randint(1000, 9999)}",
                formapago_id=f"FP{random.randint(1, 50):02}",
                solicitud_id=random.randint(1, 10000),
                solicitud_detalle_id=random.randint(1, 10000),
                vigente_desde=datetime.now().date(),
                vigente_hasta=(datetime.now() + timedelta(days=random.randint(30, 365))).date(),
                fecha_aprobacion=datetime.now(),
                ciclo=random.choice([None, "Monthly", "Quarterly", "Annual"]),
                valor=round(random.uniform(10.0, 1000.0), 4),
                valor_unidad=random.choice(["UnitA", "UnitB", "UnitC"]),
                jgp_peso=random.choice([None, random.randint(1, 10)]),
                empresa_id=random.choice([None, f"Emp{random.randint(1, 50)}"]),
                unidadnegocio_id=random.choice([None, f"UN-{random.randint(1, 50)}"]),
                linea_id=random.choice([None, random.randint(1, 1000)]),
                grupo_id=random.choice([None, random.randint(1, 1000)]),
                subgrupo_id=random.choice([None, random.randint(1, 1000)]),
                capacidad_id=random.choice([None, random.randint(1, 1000)]),
                marca_id=random.choice([None, f"Brand{random.randint(1, 50)}"]),
                ciclovida_id=random.choice([None, f"LifeCycle{random.randint(1, 20)}"]),
                producto_id=random.choice([None, f"Prod{random.randint(1, 100)}"]),
                seccion_valor=random.choice([None, "Section1", "Section2", "Section3"]),
                seccion_unidad=random.choice([None, "SectionUnit1", "SectionUnit2"]),
            )
        )

    start = time.time()
    await CoreMidPolitics.bulk_create(records, batch_size=500)
    print(f"insert took {time.time() - start:5f}")

    print(await CoreMidPolitics.all().count())


if __name__ == "__main__":
    run_async(run())

Is it possible that the issue is related to slow database, bad connectivity between the app and the database?

I am working with SQL SERVER database and this is my connection string:
DATABASE_URL="mssql://usuer:password@host/database?driver=ODBC+Driver+17+for+SQL+Server&minsize=30&maxsize=80&pool_recycle=200"

I have a good connection because my queries are fast.

@henadzit
Copy link
Contributor

henadzit commented Dec 8, 2024

@Alercard I ran the test with a docker container running MS SQL. The example above took 5 seconds to insert 720 records which is quite slow but it it's ~150 records per second, not 2 records per second as in your case. I used the profiler and it showed that most of the time it was waiting for MS SQL to process requests, not executing Tortoise code.

Which version of Tortoise do you use?

Can you verify that the code is slow with MS SQL running in a container? You can start the container with

docker run -e ACCEPT_EULA=Y -e SA_PASSWORD=Abcd12345678 -p 1433:1433 --name mssql [mcr.microsoft.com/mssql/server:2022-CU15-ubuntu-22.04](http://mcr.microsoft.com/mssql/server:2022-CU15-ubuntu-22.04)

@Alercard
Copy link
Author

Alercard commented Dec 9, 2024

Thanks @henadzit for your help! The project uses tortoise-orm==0.20.0.
The fast api aplication and MS SQL Server are running in diferent servers, and one of the client's requirements was "don't use docker" so I don't prefer to use docker.
I traced the code and I reach the last command send it to the driver It have the insert sentence and parameters for the 720 registers.
By the way, both servers use Windows Server

@Alercard
Copy link
Author

Alercard commented Dec 9, 2024

Today, I updated to tortoise-orm 0.22.2 but I am getting the same.

@Alercard
Copy link
Author

Well, I have decided to change this proccess to handle by an store procedure because I don't have time to trying to solve this issue. But anyway I am interested to know if there is something you could make for this issue.

@henadzit
Copy link
Contributor

Well, I have decided to change this proccess to handle by an store procedure because I don't have time to trying to solve this issue. But anyway I am interested to know if there is something you could make for this issue.

It's really hard to tell what's happening on your side. You are actually in the best position to figure it out and give back to the open-source project that you are using. You can use cProfile and post results here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants