Skip to content

Conversation

qianwan
Copy link

@qianwan qianwan commented Sep 22, 2025

PR Category

Type of Change

Description

Issue

Progress

  • Change is properly reviewed (1 reviewer required, 2 recommended).
  • Change is responded to an issue.
  • Change is fully covered by a UT.

Performance

Copy link
Contributor

Summary of Changes

Hello @qianwan, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates a new silu_and_mul operator into the flag_gems library, providing a C++ wrapper for the operation. This operator leverages a custom Triton kernel for optimized performance on GPUs. The changes include the necessary C++ implementation, header declarations, build system updates, and a comprehensive suite of unit tests to validate its functionality across different data types and broadcasting scenarios. The Python interface has also been updated to seamlessly use this new C++ backend.

Highlights

  • New Operator Implementation: Introduced a new silu_and_mul operator with a C++ wrapper for efficient execution.
  • Triton Kernel Integration: The silu_and_mul operator is backed by a new Triton kernel, silu_and_mul_kernel, for GPU acceleration.
  • Comprehensive Unit Tests: Added extensive unit tests for silu_and_mul covering various data types (Float16, BFloat16, Float32), the _out variant, and broadcasting capabilities, ensuring correctness and robustness.
  • Python Frontend Update: The Python frontend for gems_silu_and_mul in src/flag_gems/modules/activation.py has been updated to conditionally utilize the new C++ extension when available, improving performance.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request successfully adds a C++ wrapper for the silu_and_mul operator, including the Triton kernel, C++ implementation, tests, and Python bindings. The code is well-structured and the changes are logical. My review includes a couple of suggestions for improvement. One is an efficiency improvement in the C++ _out variant to avoid an unnecessary memory copy. The other is a small refactoring in the test code to improve clarity by using torch::silu directly. Overall, this is a solid contribution.

auto result_dtype = torch::result_type(x, y);
auto x_fp32 = x.to(torch::kFloat32);
auto y_fp32 = y.to(torch::kFloat32);
auto silu = torch::sigmoid(x_fp32) * x_fp32;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For clarity and maintainability, it's better to use torch::silu directly in your reference computation instead of manually implementing it with torch::sigmoid(x_fp32) * x_fp32. This makes the code more readable and directly expresses the intent.

Suggested change
auto silu = torch::sigmoid(x_fp32) * x_fp32;
auto silu = torch::silu(x_fp32);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As Gemini suggest, use torch::silu in the reference implementation.

Comment on lines +72 to +81
at::Tensor out_contig;
if (out.is_contiguous()) {
out_contig = out;
} else {
out_contig = out.contiguous();
}
launch_silu_and_mul(x, y, out_contig);
if (!out.is_contiguous()) {
out.copy_(out_contig);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation for handling non-contiguous output tensors in silu_and_mul_out involves a potentially expensive data copy. When out is not contiguous, out.contiguous() is called, which creates a new tensor and copies the contents of out. Since out is an output tensor, its initial contents are not needed, making this copy unnecessary.

You can improve efficiency by creating an empty tensor with the same properties as out instead. This avoids the unnecessary copy.

  auto out_contig = out.is_contiguous() ? out : at::empty_like(out);
  launch_silu_and_mul(x, y, out_contig);
  if (!out.is_contiguous()) {
    out.copy_(out_contig);
  }

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same idea with Gemini.

if (out.is_contiguous()) {
out_contig = out;
} else {
out_contig = out.contiguous();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need to permute non-contiguous out to get a contiguous out_contig. Just create a contiguous Tensor is fine. Since it is written to, not read from.

and hasattr(torch.ops, "flag_gems")
and hasattr(torch.ops.flag_gems, "silu_and_mul")
and hasattr(torch.ops.flag_gems, "silu_and_mul_out")
)
Copy link
Collaborator

@iclementine iclementine Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skip this and just use the global flag. The condition to use C++ implementation is that:

  • has_c_extension The library is built with c extenasion; and
  • use_c_extension. Controlled by an environment variable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants