Find Mode TrustVerify #3604

cderb · 2025-03-11T21:32:38Z

Additional MIOPEN_FIND_MODE = 6 (TrustVerify)
This mode extends DynamicHybrid

Running with TrustVerify will first attempt to load tuning results from system resources
If no solution is returned tuning will be triggered
If a solution is retuned the user find db will be checked for the solution
If solution is from the user find db it will be used
If a solution is from the system (db or model), the solution will be evaluated and the new time will be compared to the time reported by the solution
If the evaluated / reported time is less than the tolerance threshold then the system result is added to the user db and the solution is returned
If the evaluated / reported time exceeds the tolerance, then tuning will be triggered

This find mode will ensure that all configurations are tuned for the deployment system. Solutions from system resources are verified once and tuned if markedly different from expectation. Results from user dbs are considered reliable and used without further verification.

…ents

BrianHarrisonAMD · 2025-03-11T21:39:16Z

include/miopen/miopen.h

+    miopenConvolutionFindModeTrustVerify =
+        6,


Needs docs like the others.
The PR description is probably pretty close to what is needed!

cderb · 2025-03-12T19:42:16Z

The performance of find mode TrustVerify is between the performance of find mode DynamicHybrid and find enforce SearchDbUpdate.
TrustVerify will have the performance of SearchDbUpdate when

There is no system db or model solution
The solution returned from system is a degree slower than reported by the solution itself based on a 1 time evaluation
TrustVerify will have the performance of DynamicHybrid when
The solution returned from system resource performs as expected based on 1 time evaluation
Solution is present in user db

First run with TrustVerify is generally slower than DynamicHybrid.
After the first run, the user find db will be fully populated and user perf db populated with regenerated entries. All following runs should exhibit ideal runtimes.

cderb · 2025-03-12T20:25:51Z

docs/conceptual/tuningdb.rst

@@ -29,7 +29,6 @@ Enable this feature using these commands:

 .. code:: bash

-  export MIOPEN_FIND_MODE=3


Found that setting find enforce disables find mode inputs. So removing these unnecessary lines.

amd-jnovotny

Looks good. I've made a couple of minor editorial suggestions.

docs/how-to/find-and-immediate.rst

Co-authored-by: Jeffrey Novotny <[email protected]>

BrianHarrisonAMD · 2025-03-30T17:57:01Z

src/ocl/convolutionocl.cpp

+            auto ufdb_sols = miopen::GetSolutions<UserFindDb>(ctx, problem, 1, &invoke_ctx);
+            if(ufdb_sols.empty())


Looks like we need to read the DB twice.
I don't think it's a huge deal since it'll be cached, but if we knew which DB the result came from we could avoid the extra read.

Correct, we need to know if the entry is from the system or the user db, as there are different actions depending. There doesn't appear to be a return from the GetSolutions function presently which would indicate that.

BrianHarrisonAMD · 2025-03-30T18:03:57Z

src/ocl/convolutionocl.cpp

+                                         false);
+
+                    const float eval_time            = eval_sols.front().GetTime();
+                    constexpr float VERIFY_TOLERANCE = 1.10f;


10% is pretty tight, but maybe we are okay with that?

It's ok to loosen this tolerance. I've also seen much higher degrees of variation even when the selected solver and solver parameters are correct.

BrianHarrisonAMD

LGTM just a couple comments to consider, but I don't think they are blocking.

BradPepersAMD · 2025-04-14T13:06:07Z

Can we get a list of current issues this should fix or improve? As well as any situations that this may cause drops compared to existing results? In particular, I'm thinking of cases on distributed runs where they may be starting with a blank user db on every node.

cderb · 2025-04-14T17:33:13Z

This find mode option will function like MIOPEN_FIND_ENFORCE=3 in the worst case. Worst case is when either there is no entry or the system entry is much slower than advertised.
In the best case the entry is either in the user db or the system entry is acceptable (at which point the system entry becomes a user entry). This would be a simple db recall.
This strategy is best for long running workloads as it guarantees that configurations used are optimal.
Assuming the system entries are perfect this mode will be slower on first run than the current default DYNAMIC_HYBRID. This is due to the overhead of benchmarking the kernels to verify their runtimes.

I had wanted to split the option for tuning individual solvers to MIOPEN_FIND_ENFORCE, but this env overrides MIOPEN_FIND_MODE. So presently TRUST_VERIFY also effectively enforces MIOPEN_FIND_ENFORCE=SEARCH_DB_UPDATE, which can take quite some time. Having another option that forgoes individual solver tuning may give better results for shorter running applications and would cap the runtime closer to find mode DYNAMIC_HYBRID.

cderb added 9 commits February 26, 2025 17:59

add find mode TrustVerify, first draft

097f485

fix build errors

12c9b0f

Merge remote-tracking branch 'pub/develop' into cderb/find_mode_verify

d8990b5

tune when trustverify and performance check fails, add logging statem…

c0a9d9c

…ents

add api test for new find mode

4721681

stop addition of results when trustverify has regenerated, debug prints

881465c

fetch max workspace for trustverify

ccffe8a

remove log

5d551ad

trustverify triggers tuning on miss

eeda496

cderb requested review from BrianHarrisonAMD, BradPepersAMD, adickin-amd and JonathanLichtnerAMD as code owners March 11, 2025 21:32

BrianHarrisonAMD reviewed Mar 11, 2025

View reviewed changes

cderb and others added 2 commits March 11, 2025 18:18

Merge branch 'develop' into cderb/find_mode_verify

12c1cff

clang format

4234f4c

cderb added 2 commits March 12, 2025 15:00

clang format

eaba243

update documentation for TrustVerify

b47bc73

cderb requested a review from a team as a code owner March 12, 2025 20:23

cderb commented Mar 12, 2025

View reviewed changes

amd-jnovotny approved these changes Mar 12, 2025

View reviewed changes

docs/how-to/find-and-immediate.rst Outdated Show resolved Hide resolved

docs/how-to/find-and-immediate.rst Outdated Show resolved Hide resolved

cderb and others added 2 commits March 12, 2025 15:29

Apply suggestions from code review

759ac80

Co-authored-by: Jeffrey Novotny <[email protected]>

hip tidy

a58f556

cderb added the TESTING_CI_PASSED label Mar 14, 2025

BrianHarrisonAMD reviewed Mar 30, 2025

View reviewed changes

BrianHarrisonAMD approved these changes Mar 30, 2025

View reviewed changes

BrianHarrisonAMD mentioned this pull request Apr 11, 2025

BathNorm tune API #3646

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Find Mode TrustVerify #3604

Find Mode TrustVerify #3604

cderb commented Mar 11, 2025

BrianHarrisonAMD Mar 11, 2025

cderb Mar 13, 2025

cderb commented Mar 12, 2025

cderb Mar 12, 2025

amd-jnovotny left a comment

BrianHarrisonAMD Mar 30, 2025

cderb Mar 31, 2025

BrianHarrisonAMD Mar 30, 2025

cderb Mar 31, 2025 •

edited

Loading

BrianHarrisonAMD left a comment

BradPepersAMD commented Apr 14, 2025

cderb commented Apr 14, 2025

		@@ -29,7 +29,6 @@ Enable this feature using these commands:

		.. code:: bash

		export MIOPEN_FIND_MODE=3

		auto ufdb_sols = miopen::GetSolutions<UserFindDb>(ctx, problem, 1, &invoke_ctx);
		if(ufdb_sols.empty())

Find Mode TrustVerify #3604

Are you sure you want to change the base?

Find Mode TrustVerify #3604

Conversation

cderb commented Mar 11, 2025

BrianHarrisonAMD Mar 11, 2025

Choose a reason for hiding this comment

cderb Mar 13, 2025

Choose a reason for hiding this comment

cderb commented Mar 12, 2025

cderb Mar 12, 2025

Choose a reason for hiding this comment

amd-jnovotny left a comment

Choose a reason for hiding this comment

BrianHarrisonAMD Mar 30, 2025

Choose a reason for hiding this comment

cderb Mar 31, 2025

Choose a reason for hiding this comment

BrianHarrisonAMD Mar 30, 2025

Choose a reason for hiding this comment

cderb Mar 31, 2025 • edited Loading

Choose a reason for hiding this comment

BrianHarrisonAMD left a comment

Choose a reason for hiding this comment

BradPepersAMD commented Apr 14, 2025

cderb commented Apr 14, 2025

cderb Mar 31, 2025 •

edited

Loading