[Operator] Add cov op #276

RubiaCx · 2024-11-06T02:04:23Z

PR Category

Operator

Type of Change

New Feature

Description

Add cov op

Issue

#256

Performance

Tested on NV-A100-80G

tongxin · 2024-11-06T05:13:09Z

What's the performance metric?

RubiaCx · 2024-11-06T09:00:11Z

What's the performance metric?

I've added a benchmark to test the accuracy and performance of the covariance operation compared to Torch’s implementation:

@triton.testing.perf_report(
    triton.testing.Benchmark(
        x_names=['N'], 
        x_vals=[2**i for i in range(4, 16)],
        x_log=True, 
        line_arg='provider',  
        line_vals=['torch', 'triton'],  
        line_names=['Torch', 'Triton'],  
        styles=[('blue', '-'), ('green', '-')],  
        ylabel='GB/s',  
        plot_name='covariance-benchmark',   # Name for the plot.
        args={'M': 1024, 'correction': 1},  # Default values for parameters other than N.
    )
)
def benchmark_cov(M, N, correction, provider):
    X = torch.randn(M, N, device='cuda', dtype=torch.float32)
    fweights = torch.randint(1, 5, (N,), dtype=torch.int32, device=X.device)
    aweights = torch.rand(N, device='cuda') + 0.1  # Avoid zeros in weights

    quantiles = [0.5, 0.2, 0.8]

    if provider == 'torch':
        ms, min_ms, max_ms = triton.testing.do_bench(
            lambda: torch.cov(X, correction=correction, fweights=fweights, aweights=aweights),
            quantiles=quantiles
        )        
        result = torch.cov(X, correction=correction, fweights=fweights, aweights=aweights).cpu()
    elif provider == 'triton':
        ms, min_ms, max_ms = triton.testing.do_bench(
            lambda: cov(X, correction=correction, fweights=fweights, aweights=aweights),
            quantiles=quantiles
        )
        result = cov(X, correction=correction, fweights=fweights, aweights=aweights).cpu()

    gbps = lambda ms: 3 * X.numel() * X.element_size() * 1e-9 / (ms * 1e-3)

    if provider == 'triton':
        torch_result = torch.cov(X, correction=correction, fweights=fweights, aweights=aweights).cpu()
        precision_diff = torch.max(torch.abs(torch_result - result))
        print(f'The maximum difference between Torch and Triton is {precision_diff.item()}')
    else:
        precision_diff = torch.tensor(0.0)  # No difference for Torch itself

    return gbps(ms), gbps(max_ms), gbps(min_ms)

benchmark_cov.run(print_data=True, show_plots=True, save_path=".")

I’ve attached the results as a picture for reference. Let me know if additional details are needed!

tongxin · 2024-11-10T14:05:41Z

src/flag_gems/ops/cov.py

+def cov(X, correction=1, fweights=None, aweights=None):
+    logging.debug("GEMS ")
+
+    if not X.is_cuda:


Gems is behind the Pytorch dispatcher and doesn't need to handle non-cuda inputs.

@tongxin Thanks for the feedback! I’ve updated the code to remove it in the latest commit.

tongxin · 2024-11-10T14:37:27Z

src/flag_gems/ops/cov.py

+    mean = torch.zeros(M, device=X.device, dtype=X.dtype)
+    cov_matrix = torch.zeros((M, M), device=X.device, dtype=X.dtype)
+
+    BLOCK_SIZE = min(256, N)


This won't work if N < 256 and N is not power of 2.

tongxin · 2024-11-10T14:42:39Z

src/flag_gems/ops/cov.py

+    BLOCK_SIZE = min(256, N)
+    num_blocks = (N + BLOCK_SIZE - 1) // BLOCK_SIZE
+
+    grid = lambda meta: (M, num_blocks)


grid.y limit is 65535 for cuda so there'll be kernel param error if row size is larger than 256 * 65535, roughly 16m. Probably need a gsl style kernel or split kernels.

tongxin · 2024-11-10T14:45:45Z

src/flag_gems/ops/cov.py

+    mean_kernel[grid](X, mean, M, N, weights, BLOCK_SIZE=BLOCK_SIZE)
+    mean = mean / total_weight
+
+    grid_cov = lambda meta: (M, M, num_blocks)


Now, M is subject to maximum of 65535, which could be an issue.

tongxin · 2024-11-10T14:59:05Z

src/flag_gems/ops/cov.py

+    tl.atomic_add(cov_matrix + row * M + col, cov)
+
+def cov(X, correction=1, fweights=None, aweights=None):
+    logging.debug("GEMS ")


Oops, thanks for pointing that out.

tongxin

Hello @RubiaCx , are you planning on further revise?

RubiaCx · 2024-11-20T11:19:57Z

Hello @RubiaCx , are you planning on further revise?
@tongxin Yes, I am in the process of fixing bugs, but I've been quite busy recently.

Merge updates from upstream master into develop branch to keep it up-to-date.

RubiaCx · 2024-11-20T12:31:44Z

This update commit addresses the issue where the COV calculation would fail if N < 256 and N is not a power of 2. The code now ensures that BLOCK_SIZE is a power of 2, which resolves this problem.

However, I encountered an "illegal memory access was encountered" error when M exceeds MAX_GRID_NUM.

tongxin · 2024-11-25T07:33:29Z

This update commit addresses the issue where the COV calculation would fail if N < 256 and N is not a power of 2. The code now ensures that BLOCK_SIZE is a power of 2, which resolves this problem.

However, I encountered an "illegal memory access was encountered" error when M exceeds MAX_GRID_NUM.

You probably should try gsl style kernel to both reduce kernel calls and contain cta number.

…arge M

tongxin · 2024-12-13T01:08:40Z

Could you please try resolve conflicts and we're able to merge this PR?

tongxin · 2024-12-13T01:19:23Z

src/flag_gems/ops/cov.py

+    for i in range(num_row_chunks):
+        row_offset = i * MAX_GRID_NUM
+        current_M = min(MAX_GRID_NUM, M - row_offset)
+        grid = (current_M,)
+        mean_kernel[grid](X, mean, M, N, weights, row_offset=row_offset, BLOCK_SIZE=BLOCK_SIZE)
+    mean = mean / sum_weights
+
+    for i in range(num_row_chunks):
+        row_offset = i * MAX_GRID_NUM
+        current_rows = min(MAX_GRID_NUM, M - row_offset)    
+        for j in range(num_row_chunks):
+            col_offset = j * MAX_GRID_NUM
+            current_cols = min(MAX_GRID_NUM, M - col_offset)
+            grid = (current_rows, current_cols)
+            covariance_kernel[grid](X, cov_matrix, mean, M, N, weights, row_offset=row_offset, col_offset=col_offset, BLOCK_SIZE=BLOCK_SIZE)


This is not gsl style kernel as I previously mentioned. Multiple kernel invocations should be avoided as much as possible.

Oh, I found #91 and will refactor the cov function accordingly to adopt the GSL-style kernel as suggested. Thanks for pointing this out!

RubiaCx added 2 commits November 6, 2024 09:52

[Operator] Add cov op

be80e23

[Operator] Add cov op

dfc6237

tongxin reviewed Nov 10, 2024

View reviewed changes

tongxin self-assigned this Nov 10, 2024

[Operator] Add covariance (cov) op & Remove unused code

571308b

RubiaCx force-pushed the develop branch from 7c56d35 to 571308b Compare November 10, 2024 15:11

tongxin requested changes Nov 20, 2024

View reviewed changes

RubiaCx added 3 commits November 20, 2024 20:07

[Operator] Correct covariance (cov) op indexing issue

620d842

Merge remote-tracking branch 'upstream/master' into develop

5e581b6

Merge updates from upstream master into develop branch to keep it up-to-date.

[Operator] Fix covariance (cov) op shape issues and handle edge cases

8d36687

Tango2018cc mentioned this pull request Nov 26, 2024

Code Contribution: 【Lv2】【Operator Development】cov #256

Open

[Operator] Reduce MAX_GRID_NUM and implement sub-block handling for l…

69f1eda

…arge M

tongxin requested changes Dec 13, 2024

View reviewed changes

RubiaCx added 2 commits December 13, 2024 10:48

Merge to keep branches in sync

831fec0

Merge master into COV op with major updates

cb44c17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Operator] Add cov op #276

[Operator] Add cov op #276

RubiaCx commented Nov 6, 2024

tongxin commented Nov 6, 2024 •

edited

Loading

RubiaCx commented Nov 6, 2024

tongxin Nov 10, 2024

RubiaCx Nov 10, 2024

tongxin Nov 10, 2024

tongxin Nov 10, 2024

tongxin Nov 10, 2024

tongxin Nov 10, 2024

RubiaCx Nov 10, 2024

tongxin left a comment

RubiaCx commented Nov 20, 2024

RubiaCx commented Nov 20, 2024 •

edited by tongxin

Loading

tongxin commented Nov 25, 2024 •

edited

Loading

tongxin commented Dec 13, 2024

tongxin Dec 13, 2024

RubiaCx Dec 13, 2024

[Operator] Add cov op #276

Are you sure you want to change the base?

[Operator] Add cov op #276

Conversation

RubiaCx commented Nov 6, 2024

PR Category

Type of Change

Description

Issue

Performance

tongxin commented Nov 6, 2024 • edited Loading

RubiaCx commented Nov 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tongxin left a comment

Choose a reason for hiding this comment

RubiaCx commented Nov 20, 2024

RubiaCx commented Nov 20, 2024 • edited by tongxin Loading

tongxin commented Nov 25, 2024 • edited Loading

tongxin commented Dec 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tongxin commented Nov 6, 2024 •

edited

Loading

RubiaCx commented Nov 20, 2024 •

edited by tongxin

Loading

tongxin commented Nov 25, 2024 •

edited

Loading