[Documentation] Explain performance improvements #670

jan-janssen · 2025-06-09T23:09:13Z

Generate data:

import numpy as np
import pandas as pd

N = 1_000_000
data = pd.DataFrame({
    "c": np.random.choice(["a", "b", "c"], size=N),
    "x": np.random.uniform(size=N),
    "y": np.random.normal(size=N)
})

data.to_csv("blob.csv")  # File is about 45 Mb

Slow execution: 24.1 s ± 115 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

import numpy as np
import pandas as pd
from executorlib import SingleNodeExecutor

def get_sum(i, df):
    return i, df["x"].sum(), df["y"].sum()

with SingleNodeExecutor(max_workers=10) as exe:
    future_lst = [exe.submit(get_sum, df=pd.read_csv("blob.csv"), i=i) for i in range(100)]
    result_lst = [f.result() for f in future_lst]

Reduce the startup time for the processes: 19.5 s ± 31.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

import numpy as np
import pandas as pd
from executorlib import SingleNodeExecutor

def get_sum(i, df):
    return i, df["x"].sum(), df["y"].sum()

with SingleNodeExecutor(max_workers=10, block_allocation=True) as exe:
    future_lst = [exe.submit(get_sum, df=pd.read_csv("blob.csv"), i=i) for i in range(100)]
    result_lst = [f.result() for f in future_lst]

Load the data only once for each process: 946 ms ± 24.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

import numpy as np
import pandas as pd
from executorlib import SingleNodeExecutor

def get_sum(i, df):
    return i, df["x"].sum(), df["y"].sum()

def init_funct():
    return {"df": pd.read_csv("blob.csv")}

with SingleNodeExecutor(max_workers=10, block_allocation=True, init_function=init_funct) as exe:
    future_lst = [exe.submit(get_sum, i=i) for i in range(100)]
    result_lst = [f.result() for f in future_lst]

jan-janssen · 2025-06-14T04:16:51Z

In addition, include the option to monitor the data transfer to fine-tune the performance #671

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Documentation] Explain performance improvements #670

[Documentation] Explain performance improvements #670

jan-janssen commented Jun 9, 2025 •

edited

Loading

jan-janssen commented Jun 14, 2025

Uh oh!

[Documentation] Explain performance improvements #670

[Documentation] Explain performance improvements #670

Comments

jan-janssen commented Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

jan-janssen commented Jun 14, 2025

Uh oh!

jan-janssen commented Jun 9, 2025 •

edited

Loading