Optimize colorize using matmul and inplace operations #1437

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

philippjfr wants to merge 9 commits into main from optimize_colorize

Member

philippjfr commented Aug 26, 2025 •

edited

Loading

As the title says, this attempts to optimize the colorize part of the shade operation by avoiding temporary copies and performing a single matmul operation rather than multiple dot operations. In my testing this is about a 10% speedup. My guess is that this could result in even better performance for systems with MKL support.


          Optimize colorize using matmul and inplace operations

90b8650

codecov bot commented Aug 26, 2025 •

edited

Loading

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 88.34%. Comparing base (f44670c) to head (36a703c).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1437   +/-   ##
=======================================
  Coverage   88.33%   88.34%           
=======================================
  Files          96       96           
  Lines       18901    18908    +7     
=======================================
+ Hits        16696    16704    +8     
+ Misses       2205     2204    -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

hoxbro reviewed

View reviewed changes

datashader/transfer_functions/__init__.py Outdated Show resolved Hide resolved

datashader/transfer_functions/__init__.py

Comment on lines -402 to +411

    
                          color_data -= baseline

                          np.subtract(color_data, baseline, out=color_data, where=color_mask)

Member

hoxbro Aug 27, 2025

Wasn't this already in place? Is it to avoid calculation along the color_mask axis?

datashader/transfer_functions/__init__.py Outdated Show resolved Hide resolved

datashader/transfer_functions/__init__.py Outdated

Comment on lines 432 to 433

    
                  cd2 = color_data.reshape(-1, C)

                  rgb_sum = (cd2 @ RGB).reshape(H, W, 3)  # weighted sums for r,g,b

Member

hoxbro Aug 27, 2025

Have you tried to play around with Einstein notation or np.tensordot for this? AI generated benchmark:

import numpy as np
import timeit

def benchmark(H, W, C):
    # Random test data
    color_data = np.random.rand(H, W, C)
    RGB = np.random.rand(C, 3)

    # Method 1: reshape + matmul
    def method_matmul():
        cd2 = color_data.reshape(-1, C)
        return (cd2 @ RGB).reshape(H, W, 3)

    # Method 2: einsum
    def method_einsum():
        return np.einsum('hwc,cj->hwj', color_data, RGB)

    # Method 3: einsum with optimize=True
    def method_einsum_opt():
        return np.einsum('hwc,cj->hwj', color_data, RGB, optimize=True)

    # Method 4: tensordot
    def method_tensordot():
        return np.tensordot(color_data, RGB, axes=([2],[0]))  # shape: (H, W, 3)

    # Verify correctness
    out1 = method_matmul()
    out2 = method_einsum()
    out3 = method_einsum_opt()
    out4 = method_tensordot()
    assert np.allclose(out1, out2)
    assert np.allclose(out1, out3)
    assert np.allclose(out1, out4)

    # Benchmark
    time_matmul = timeit.timeit(method_matmul, number=10)
    time_einsum = timeit.timeit(method_einsum, number=10)
    time_einsum_opt = timeit.timeit(method_einsum_opt, number=10)
    time_tensordot = timeit.timeit(method_tensordot, number=10)

    print(f"H={H}, W={W}, C={C}")
    print(f"reshape+matmul: {time_matmul:.4f} s")
    print(f"einsum: {time_einsum:.4f} s")
    print(f"einsum (optimize=True): {time_einsum_opt:.4f} s")
    print(f"tensordot: {time_tensordot:.4f} s")
    print("-" * 50)

# Test different shapes
benchmark(256, 256, 64)
benchmark(512, 512, 64)
benchmark(256, 256, 256)
benchmark(128, 128, 1024)

H=256, W=256, C=64
reshape+matmul: 0.0284 s
einsum: 0.1792 s
einsum (optimize=True): 0.0188 s
tensordot: 0.0280 s
--------------------------------------------------
H=512, W=512, C=64
reshape+matmul: 0.1330 s
einsum: 0.5891 s
einsum (optimize=True): 0.0341 s
tensordot: 0.1112 s
--------------------------------------------------
H=256, W=256, C=256
reshape+matmul: 0.1042 s
einsum: 0.6018 s
einsum (optimize=True): 0.0419 s
tensordot: 0.1038 s
--------------------------------------------------
H=128, W=128, C=1024
reshape+matmul: 0.0713 s
einsum: 0.6043 s
einsum (optimize=True): 0.0497 s
tensordot: 0.0712 s
--------------------------------------------------

Member Author

philippjfr Aug 27, 2025

Thanks, not 100% sure what to take from that though I will state that in most cases C << 100.

Member

hoxbro Aug 27, 2025

Why I said it was AI-generated. I assume you had a more fully fledged way to actually measure your performance. This was more to show that there was a tiny bit of performance that could be gained here.

datashader/transfer_functions/__init__.py

    
                  # Replace NaNs with 0s for dot/matmul in one pass (in-place)

                  # If you don't want to mutate color_data contents further, copy first.

                  np.nan_to_num(color_data, copy=False)  # NaN -> 0

Member

hoxbro Aug 27, 2025

Can't we use an existing mask for this? I would assume nan_to_mask would do this calculation behind the scenes.

Also note that this will convert inf values to floats.

dcherian reviewed

View reviewed changes

datashader/transfer_functions/__init__.py Outdated Show resolved Hide resolved


          Merge branch 'main' into optimize_colorize

df4cd39

codspeed-hq bot commented Sep 8, 2025 •

edited

Loading

CodSpeed Instrumentation Performance Report

Merging #1437 will degrade performances by 17.88%

_{Comparing optimize_colorize (36a703c) with main (a4d57be)}

Summary

❌ 1 regression
✅ 42 untouched

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

	Benchmark	`BASE`	`HEAD`	Change
❌	`test_layout[forceatlas2_layout]`	68.7 ms	83.6 ms	-17.88%

hoxbro added 2 commits

September 10, 2025 12:20


          Merge branch 'main' into optimize_colorize

fb29fa1


          Merge branch 'main' into optimize_colorize

fb2c16e

Member Author

philippjfr commented Sep 10, 2025

My profiling code

import time

import numpy as np
import pandas as pd
import datashader as ds

N = int(10e6)
C = 20

def gen_data(N=int(10e6), C=20):
    xy = np.random.randn(int(N), 2)
    c = np.random.choice([chr(65+i) for i in range(C)], size=N)
    df = pd.DataFrame(xy, columns=['x', 'y'])
    df['c'] = pd.Series(c).astype('category')
    return df

def profile(df, size=1000):
    W = H = size
    cvs = ds.Canvas(plot_width=W, plot_height=H)
    agg = cvs.points(df, x='x', y='y', agg=ds.count_cat('c'))

    pre = time.monotonic()
    ds.transfer_functions.shade(agg)
    return time.monotonic()-pre

# Warmup
df = gen_data(C=1)
profile(df, size=10)

results = []
for c in (1, 5, 10, 20):
    df = gen_data(C=c)
    for s in range(1000, 6000, 1000):
        timing = profile(df, size=s)
        results.append((c, s, timing))

Screenshot 2025-09-10 at 17 50 04

hoxbro added 4 commits

September 12, 2025 15:01


          misc update

e1df3c2


          swap order to avoid nansum

94ddff7


          use einsum

fe6fcec


          Optimize rgb_array insertion

6b0982b

Member

hoxbro commented Sep 12, 2025

Current status:

current = 6b0982b, before = fb2c16e, main = 184ef3c

Benchmark code

import time
import numpy as np
import pandas as pd
import datashader as ds
import datashader.transfer_functions as tf

N = int(10e6)
C = 20


class Profile:
    def __init__(self, output_file):
        import cProfile

        self.profiler = cProfile.Profile()
        self.output_file = output_file

    def __enter__(self):
        self.profiler.enable()
        return self.profiler

    def __exit__(self, exc_type, exc_val, exc_tb):
        self.profiler.disable()
        self.profiler.dump_stats(self.output_file)


class LineProfileContext:
    def __init__(self, output_file):
        from line_profiler import LineProfiler

        self.profiler = LineProfiler()
        self.output_file = output_file
        self.functions_to_profile = []

    def add_function(self, func):
        """Add a function to be profiled line-by-line"""
        self.profiler.add_function(func)
        self.functions_to_profile.append(func)
        return func

    def __enter__(self):
        self.profiler.enable_by_count()
        return self

    def __exit__(self, exc_type, exc_val, exc_tb):
        self.profiler.disable_by_count()

        self.profiler.dump_stats(self.output_file)
        self.profiler.print_stats()


def gen_data(N=N, C=C):
    np.random.seed(1)
    xy = np.random.randn(int(N), 2)
    c = np.random.choice([chr(65 + i) for i in range(C)], size=N)
    df = pd.DataFrame(xy, columns=["x", "y"])
    df["c"] = pd.Series(c).astype("category")
    return df


def profile(df, size=1000):
    W = H = size
    cvs = ds.Canvas(plot_width=W, plot_height=H)
    agg = cvs.points(df, x="x", y="y", agg=ds.count_cat("c"))

    tf.shade(agg)  # warm up
    pre = time.monotonic()

    # with LineProfileContext("line_profile.lprof") as line_profiler:
    #     line_profiler.add_function(tf._colorize)
    #     tf.shade(agg)

    # with Profile(output_file="optional.perf"):
    #     ds.transfer_functions.shade(agg)
    return time.monotonic() - pre


# Warmup
df = gen_data(C=20)
profile(df, size=5000)

results = []
for c in (1, 5, 10, 20):
    df = gen_data(C=c)
    for s in range(1000, 6000, 1000):
        timing = profile(df, size=s)
        results.append((c, s, timing))
        print(f"{c=}, {s=}, {timing=}")

Plotting

current = [  # 6b0982b
    dict(c=1, s=1000, timing=0.07571580299918423),
    dict(c=1, s=2000, timing=0.295644159999938),
    dict(c=1, s=3000, timing=0.6464670440000191),
    dict(c=1, s=4000, timing=1.1230143669999961),
    dict(c=1, s=5000, timing=1.7188509200004773),
    dict(c=5, s=1000, timing=0.11476161499922455),
    dict(c=5, s=2000, timing=0.44561883799906354),
    dict(c=5, s=3000, timing=0.9790756620004686),
    dict(c=5, s=4000, timing=1.7118233849996614),
    dict(c=5, s=5000, timing=2.6856131889999233),
    dict(c=10, s=1000, timing=0.14612587800002075),
    dict(c=10, s=2000, timing=0.5675542549997772),
    dict(c=10, s=3000, timing=1.2379318599996623),
    dict(c=10, s=4000, timing=2.2251677369986282),
    dict(c=10, s=5000, timing=3.397321854999973),
    dict(c=20, s=1000, timing=0.1993868179997662),
    dict(c=20, s=2000, timing=0.8214870430001611),
    dict(c=20, s=3000, timing=1.7614306820014463),
    dict(c=20, s=4000, timing=3.0943053329992836),
    dict(c=20, s=5000, timing=4.7508491489988955),
]

before = [  # fb2c16e
    dict(c=1, s=1000, timing=0.07645769699956873),
    dict(c=1, s=2000, timing=0.3170905290007795),
    dict(c=1, s=3000, timing=0.7142776969994884),
    dict(c=1, s=4000, timing=1.2551025209995714),
    dict(c=1, s=5000, timing=1.9599227520011482),
    dict(c=5, s=1000, timing=0.16230177999932494),
    dict(c=5, s=2000, timing=0.5520949959991412),
    dict(c=5, s=3000, timing=1.2177506650004943),
    dict(c=5, s=4000, timing=2.171157504999428),
    dict(c=5, s=5000, timing=3.3560801679996075),
    dict(c=10, s=1000, timing=0.2009295749994635),
    dict(c=10, s=2000, timing=0.7160231019988714),
    dict(c=10, s=3000, timing=1.6094946339999296),
    dict(c=10, s=4000, timing=2.7828460880009516),
    dict(c=10, s=5000, timing=4.274540911001168),
    dict(c=20, s=1000, timing=0.2542700350004452),
    dict(c=20, s=2000, timing=0.9284682460001932),
    dict(c=20, s=3000, timing=2.0608999519990903),
    dict(c=20, s=4000, timing=3.6744658019997587),
    dict(c=20, s=5000, timing=5.747611536000477),
]
main = [  # 184ef3c
    dict(c=1, s=1000, timing=0.0718935530003364),
    dict(c=1, s=2000, timing=0.31208833799973945),
    dict(c=1, s=3000, timing=0.7055044320004527),
    dict(c=1, s=4000, timing=1.2214937410008133),
    dict(c=1, s=5000, timing=1.899291293000715),
    dict(c=5, s=1000, timing=0.1668667740013916),
    dict(c=5, s=2000, timing=0.6655240790005337),
    dict(c=5, s=3000, timing=1.5014597809986299),
    dict(c=5, s=4000, timing=2.5980365989998973),
    dict(c=5, s=5000, timing=4.086923677999948),
    dict(c=10, s=1000, timing=0.2160664650000399),
    dict(c=10, s=2000, timing=0.8515692499986471),
    dict(c=10, s=3000, timing=1.879354708999017),
    dict(c=10, s=4000, timing=3.3537094929997693),
    dict(c=10, s=5000, timing=5.179373872999349),
    dict(c=20, s=1000, timing=0.3006982959996094),
    dict(c=20, s=2000, timing=1.1935125410000182),
    dict(c=20, s=3000, timing=2.72798394000165),
    dict(c=20, s=4000, timing=4.852396742000565),
    dict(c=20, s=5000, timing=7.331704080999771),
]

import hvplot.pandas
import pandas as pd

fn = (
    lambda x: pd.DataFrame(eval(x))
    .hvplot.bar(x="s", y="timing", by="c", title=x)
    .opts(show_grid=True)
)

(fn("current") + fn("before") + fn("main")).cols(1)

hoxbro force-pushed the optimize_colorize branch from 7e401cd to 6b0982b Compare

September 16, 2025 07:10


          try without nansum_missing

36a703c

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet