Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blank output issue with CUDAExecutionProvider - Onnx Model Converted to fp16 #23797

Open
abhishetty7191 opened this issue Feb 24, 2025 · 0 comments

Comments

@abhishetty7191
Copy link

abhishetty7191 commented Feb 24, 2025

Describe the issue

Issue description:
When running a UNet segmentation model that has been converted to ONNX format with FP16 precision, using the CUDAExecutionProvider, the output appears to be blank. This issue is encountered during the inference stage, where the model should ideally generate segmentation masks for the input images. However, instead of producing meaningful outputs, the model returns empty or blank results.

Expected Behavior:
The output with CUDAExecutionProvider should match with PyTorch output and other EPs(CPUExecutionProvider and TensorrtExecutionProvider) output.

Results:

Image

To reproduce

Comparison of outputs of Pytorch, Onnxruntime with Execution Providers(EPs)(CPUExecutionProvider, CUDAExecutionProvider and TensorrtExecutionProvider)

from segmentation_models_pytorch import Unet
import torch

torch.manual_seed(0)

import torchvision.transforms as transforms
import cv2
import numpy as np
import onnxruntime as ort
import matplotlib.pyplot as plt

print(f"Torch version: {torch.__version__}")
print(f"Onnxruntime version: {ort.__version__}")

Torch version: 2.2.2+cu121
Onnxruntime version: 1.20.1

Preprocessing Transformations for Input Images

preprocess = transforms.Compose(
    [
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    ]
)

Load and preprocess the input image using OpenCV

Image Reference

image_path = "n01644373_tree_frog.JPEG"  # Replace with your image path
input_image = cv2.imread(image_path)
input_image = cv2.cvtColor(input_image, cv2.COLOR_BGR2RGB)
input_image_resized = cv2.resize(input_image, (512, 512))
plt.imshow(input_image)
plt.axis("off")
plt.show()
input_data = preprocess(input_image_resized)
input_data = input_data.unsqueeze(0)
input_data = input_data.numpy()
# Check if a GPU is available and use it if possible
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

Image

Define the Unet model

def load_pytorch_model():
    model = Unet(
        encoder_name="timm-efficientnet-b2",
        in_channels=3,
        classes=5,
        encoder_weights="imagenet",
    )
    model.eval()  # Set the model to evaluation mode
    model.to(device)
    return model

Pytorch

Define Pytorch inference and post process

def pytorch_inference(model, input):

    input_batch = torch.Tensor(input).to(device)

    # Perform inference
    with torch.no_grad():
        output = model(input_batch)

    # Postprocess the output
    output = output.cpu().numpy()
    return output.argmax(axis=1)[0]

Final output - resize to original image size

def resize_output(input):
    return cv2.resize(input, (input_image.shape[1], input_image.shape[0]))

Pytorch Output

pytorch_model = load_pytorch_model()
pytorch_output = pytorch_inference(pytorch_model, input_data)
plt.imshow(resize_output(pytorch_output.astype(np.uint8) * 50), cmap="gray")
plt.axis("off")

Image

Onnxruntime EPs

Define model conversion to onnx

Model is converted to fp16 with input and output of model kept as float32

def onnx_conversion(pytorch_model, output_path, enable_fp16=False):
    import onnx
    from onnxconverter_common import float16

    # FP32CastedModel class:
    class FP32CastedModel(torch.nn.Module):
        def __init__(self, model):
            super().__init__()
            self.model = model

        def __call__(self, input):
            with torch.no_grad():
                output = self.model(input)
            return output.to(torch.float32)

    random_tensor = torch.randn(
        size=(1, 3, 512, 512), requires_grad=True, dtype=torch.float32
    ).cuda()

    with torch.no_grad():
        torch.onnx.export(
            model=FP32CastedModel(pytorch_model),
            f=output_path,
            args=random_tensor,
            export_params=True,
            input_names=["input"],
            output_names=["output"],
            do_constant_folding=True,
            opset_version=17,
            dynamic_axes={"input": {0: "batch_size"}, "output": {0: "batch_size"}},
        )

    model = onnx.load(output_path)
    if enable_fp16:
        model = float16.convert_float_to_float16(model, keep_io_types=True)
        onnx.save(model, output_path)
    return model

Using pytorch model to generate onnx model

Here there will be warnings such as

UserWarning: the float32 number -3.94251777890986e-08 will be truncated to -1e-07 warnings.warn("the float32 number {} will be truncated to {}".format(neg_max, -min_positive_val))

enable_fp16 = True
onnx_model_name = "unet_model_fp16.onnx" if enable_fp16 else "unet_model.onnx"
onnx_model = onnx_conversion(pytorch_model, onnx_model_name, enable_fp16)

Define loading Onnx model separately with each EPs

For CUDAExecutionProvider, the flag cudnn_conv_use_max_workspace is set to 1.

For TensorrtExecutionProvider, the flag trt_fp16_enable is set True for fp16 model.

def load_onnx_model_cpu(model_path):
    """onnx CPU inference"""
    providers = [
        (
            "CPUExecutionProvider",
            {},
        )
    ]
    sess_options = ort.SessionOptions()
    # sess_options.log_severity_level = 0  # Verbose
    sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL

    return ort.InferenceSession(
        model_path, sess_options=sess_options, providers=providers
    )
def load_onnx_model_cuda(model_path):
    """onnx CUDA inference"""
    providers = [
        (
            "CUDAExecutionProvider",
            {
                "device_id": 0,
                "arena_extend_strategy": "kNextPowerOfTwo",
                "gpu_mem_limit": 10 * 1024 * 1024 * 1024,
                "cudnn_conv_algo_search": "HEURISTIC",
                "do_copy_in_default_stream": True,
                "cudnn_conv_use_max_workspace": 1,
            },
        )
    ]
    sess_options = ort.SessionOptions()
    # sess_options.log_severity_level = 0  # Verbose
    sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL

    return ort.InferenceSession(
        model_path, sess_options=sess_options, providers=providers
    )
def load_onnx_model_trt(model_path, enable_fp16=False):
    """onnx TRT inference"""
    providers = [
        (
            "TensorrtExecutionProvider",
            {
                "device_id": 0,
                "trt_fp16_enable": enable_fp16,
                "trt_builder_optimization_level": 5,
                "trt_max_workspace_size": 10 * 1024 * 1024 * 1024,
                "trt_engine_cache_enable": True,
                "trt_timing_cache_enable": True,
                "trt_engine_cache_path": "ModelCache/",
                "trt_timing_cache_path": "ModelCache/",
            },
        )
    ]
    sess_options = ort.SessionOptions()
    # sess_options.log_severity_level = 0  # Verbose
    sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL

    return ort.InferenceSession(
        model_path, sess_options=sess_options, providers=providers
    )

Loading model with all EPs separately

onnx_cpu = load_onnx_model_cpu(onnx_model_name)
onnx_cuda = load_onnx_model_cuda(onnx_model_name)
onnx_trt = load_onnx_model_trt(onnx_model_name, enable_fp16)

Inference with all Execution Providers(EPs) separately

def ort_inference(model, input):
    input_name = onnx_cuda.get_inputs()[0].name
    output = model.run(None, {input_name: input})
    return output[0].astype(np.float32).argmax(axis=1)[0]
# cpu inference
ort_output_cpu = ort_inference(onnx_cpu, input_data)
# cuda inference
ort_output_cuda = ort_inference(onnx_cuda, input_data)
# trt inference
ort_output_trt = ort_inference(onnx_trt, input_data)

Final Results - input image and predicted outputs

fig, axs = plt.subplots(2, 3)
axs[0][0].axis("off")
axs[0][1].axis("off")
axs[0][2].axis("off")
axs[1][0].axis("off")
axs[1][1].axis("off")
axs[1][2].axis("off")
axs[0][0].imshow(input_image)
axs[0][0].title.set_text("Input")
axs[0][1].imshow(resize_output(pytorch_output.astype(np.uint8) * 50), cmap="gray")
axs[0][1].title.set_text("Pytorch output")
axs[1][0].imshow(resize_output(ort_output_cpu.astype(np.uint8) * 50), cmap="gray")
axs[1][0].title.set_text("ort_cpu output")
axs[1][1].imshow(resize_output(ort_output_cuda.astype(np.uint8) * 50), cmap="gray")
axs[1][1].title.set_text("ort_cuda output")
axs[1][2].imshow(resize_output(ort_output_trt.astype(np.uint8) * 50), cmap="gray")
axs[1][2].title.set_text("ort_trt output")

Image

Urgency

High, using TensorrtExecutionProvider for now

Platform

Windows

OS Version

11

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

1.20.1

ONNX Runtime API

Python

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

12.6

cuDNN Library Version

9.5.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant