Skip to content

[Bug] no kernel image is available for execution on the device when using sage_hub attention backend on RTX 5090 (Blackwell, sm_120) #13043

@timeuser4

Description

@timeuser4

Describe the bug

When enabling the experimental sage_hub attention backend on RTX 5090 (Blackwell architecture, compute capability 12.0) with PyTorch 2.8 + CUDA 12.9, inference fails with CUDA kernel compatibility error:

Error no kernel image is available for execution on the device at line 73 in file /src/csrc/ops.cu

Reproduction

import torch
from diffusers.pipelines.flux2.pipeline_flux2 import Flux2Pipeline
import numpy as np
pipe = Flux2Pipeline.from_pretrained("models/FLUX.2-dev-bnb-4bit", text_encoder=None, torch_dtype=torch.bfloat16).to("cuda:0")
pipe.transformer.set_attention_backend("sage_hub"). # <- Bug here. When I commented out that line, it worked fine.
# pipe.load_lora_weights("models/flux-dev-inpaint/pytorch_lora_weights.safetensors")

def create_random_pil(size=(512, 512)):
     arr = np.random.randint(0, 255, (size[1], size[0], 3), dtype=np.uint8)
     return Image.fromarray(arr)

coarse_pil = create_random_pil((512, 512))
garment_pil = create_random_pil((512, 512))
prompt_embeds = torch.randn(1, 256, 15360, dtype=torch.bfloat16, device="cuda:0")

images = pipe(
     image=[coarse_pil, garment_pil],
     prompt_embeds=prompt_embeds,
     height=512,
     width=512,
     guidance_scale=7.5,
     num_inference_steps=30,
     generator=torch.Generator("cpu").manual_seed(42),
 )

Logs

... ...(Initialize output information)
0%|                                                                                                                                                                   | 0/30 [00:00<?, ?it/s]########tensor_layout NHD
before kernel call
after kernel call
Error no kernel image is available for execution on the device at line 73 in file /src/csrc/ops.cu

System Info

- 🤗 Diffusers version: 0.36.0.dev0
- Platform: Linux-5.15.0-83-generic-x86_64-with-glibc2.35
- Running on Google Colab?: No
- Python version: 3.12.12
- PyTorch version (GPU?): 2.8.0+cu129 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Huggingface_hub version: 0.36.0
- Transformers version: 4.57.3
- Accelerate version: 1.12.0
- PEFT version: 0.18.1
- Bitsandbytes version: 0.49.1
- Safetensors version: 0.7.0
- xFormers version: not installed
- Accelerator: NVIDIA GeForce RTX 5090, 32607 MiB
NVIDIA GeForce RTX 5090, 32607 MiB
NVIDIA GeForce RTX 5090, 32607 MiB
NVIDIA GeForce RTX 5090, 32607 MiB
NVIDIA GeForce RTX 5090, 32607 MiB
NVIDIA GeForce RTX 5090, 32607 MiB
NVIDIA GeForce RTX 5090, 32607 MiB
NVIDIA GeForce RTX 5090, 32607 MiB
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

Who can help?

@yiyixuxu @DN6

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions