Skip to content

Conversation

@alexheretic
Copy link
Contributor

@alexheretic alexheretic commented Dec 17, 2025

Remove cudnn.enabled = False for AMD cards so MIOpen is enabled again.

Default env vars if not specified (so these are easy to override by users if they care):

  • MIOPEN_FIND_MODE=FAST solves initial slowdown issues particularly for VAE (miopen searching also seems to have little actual perf benefit if you let it run, at least in my experience on rdna3 for sdxl & wan) so this seems a better default.
  • PYTORCH_MIOPEN_SUGGEST_NHWC=0 This resolves the significant regression in ImageUpscaleWithModel perf with miopen enabled on > rocm 7.

In particular this improves ImageUpscaleWithModel perf on rocm7.1: 7.9s -> 2.4s
(using a simple single image example workflow).

Tested on my 7900 GRE (rdna3) on Linux with rocm 7.1 & 6.4.

Resolves #10447
Relates to #10302, #10448, pytorch/pytorch#170764, ROCm/TheRock#2485

Default MIOPEN_FIND_MODE=FAST
Default PYTORCH_MIOPEN_SUGGEST_NHWC=0
@alexheretic
Copy link
Contributor Author

cc @comfyanonymous can you re-check if this works as well as disabling cudnn for your test scenarios? The additional PYTORCH_MIOPEN_SUGGEST_NHWC=0 switch resolves perf issues with rocm7.1 upscaling for me.

@lostdisc
Copy link

FWIW, I tested these changes for side-effects on SDXL on my Ryzen AI APU. Results:

  • With cudnn enabled, MIOPEN_FIND_MODE=FAST does avoid extreme VAE slowness for the first run of a given resolution. I had recently tried setting this in conda but surprisingly got no effect, whereas this code is effective. It also seems per-session, since reverting the code makes the issue return.
  • However, VAE decode still takes me an extra ~1.5 minutes of high GPU usage every time (for 1280x1600 dimensions), roughly doubling my gen time compared to having cudnn disabled. Avoiding this side effect out-of-the-box would be preferable. Adding code to selectively disable cudnn for VAE on AMD like sfinktah's wrapper may work if combined with MIOPEN_FIND_MODE=FAST. Without the latter set, the wrapper still has some first-run slowness.
  • At the start of KSampler, I quickly get 15 lines of warnings like this:
    MIOpen(HIP): Warning [IsEnoughWorkspace] [GetSolutionsFallback WTI] Solver <GemmFwdRest>, workspace required: 14745600, provided ptr: 0000000000000000 size: 0
    Where the "workspace required" number varies. The run is able to keep going and finish though. (Note that if cudnn is enabled and MIOPEN_FIND_MODE is not set to FAST, there is a first-run-for-this-resolution delay of a few minutes at the start of KSampler, but no warning messsages.)

I look forward to the day MIOpen itself doesn't cripple VAE, haha.

@alexheretic
Copy link
Contributor Author

I'm using tiled vae most places. But iirc i tested with miopen on/off and tiling was still often beneficial either way. For me disabling miopen has a significant negative effect on upscaling perf, but otherwise is fairly similar.

With the work upstream to improve miopen i think it makes sense to have it default enabled. We could tune the defaults more per arch since my testing is for rdna3. I could make this pr just for rdna3 if that helps.

@lostdisc
Copy link

Yeah, tiled VAE suffers much less from MIOpen somehow. I like getting away with untiled since it shaves ~10s off versus tiled when MIOpen is disabled, but I admit always tiling would have stability benefits in exchange for the tiny speed hit.

@alexheretic
Copy link
Contributor Author

Upstream have disabled nhwc pytorch/pytorch#170780

I'll test again later to see if this fixes the need to specify it here.

@lostdisc
Copy link

ComfyUI 0.6 added an env var for enabling MIOpen for testing. And I see AMD is inviting logs to help fix the underlying issue, thank goodness.

@alexheretic alexheretic closed this Jan 9, 2026
@alexheretic
Copy link
Contributor Author

alexheretic commented Jan 9, 2026

Upstream pytorch have addressed the nhwc issue pytorch/pytorch#170764 so I guess we can close this awaiting future pytorch. I still think it would be nice to have default settings provide a better experience though and hence it would be nice to merge some of these amd improvements once in a while.

I also still think MIOPEN_FIND_MODE=FAST is the correct default based on my experience with miopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Disabling cudnn regresses ImageUpscaleWithModel performance on ROCM 6.4

2 participants