Skip to content

[Documentation] Explain performance improvements #670

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jan-janssen opened this issue Jun 9, 2025 · 1 comment
Open

[Documentation] Explain performance improvements #670

jan-janssen opened this issue Jun 9, 2025 · 1 comment

Comments

@jan-janssen
Copy link
Member

jan-janssen commented Jun 9, 2025

Generate data:

import numpy as np
import pandas as pd

N = 1_000_000
data = pd.DataFrame({
    "c": np.random.choice(["a", "b", "c"], size=N),
    "x": np.random.uniform(size=N),
    "y": np.random.normal(size=N)
})

data.to_csv("blob.csv")  # File is about 45 Mb

Slow execution: 24.1 s ± 115 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

import numpy as np
import pandas as pd
from executorlib import SingleNodeExecutor

def get_sum(i, df):
    return i, df["x"].sum(), df["y"].sum()

with SingleNodeExecutor(max_workers=10) as exe:
    future_lst = [exe.submit(get_sum, df=pd.read_csv("blob.csv"), i=i) for i in range(100)]
    result_lst = [f.result() for f in future_lst]

Reduce the startup time for the processes: 19.5 s ± 31.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

import numpy as np
import pandas as pd
from executorlib import SingleNodeExecutor

def get_sum(i, df):
    return i, df["x"].sum(), df["y"].sum()

with SingleNodeExecutor(max_workers=10, block_allocation=True) as exe:
    future_lst = [exe.submit(get_sum, df=pd.read_csv("blob.csv"), i=i) for i in range(100)]
    result_lst = [f.result() for f in future_lst]

Load the data only once for each process: 946 ms ± 24.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

import numpy as np
import pandas as pd
from executorlib import SingleNodeExecutor

def get_sum(i, df):
    return i, df["x"].sum(), df["y"].sum()

def init_funct():
    return {"df": pd.read_csv("blob.csv")}

with SingleNodeExecutor(max_workers=10, block_allocation=True, init_function=init_funct) as exe:
    future_lst = [exe.submit(get_sum, i=i) for i in range(100)]
    result_lst = [f.result() for f in future_lst]
@jan-janssen
Copy link
Member Author

In addition, include the option to monitor the data transfer to fine-tune the performance #671

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant