Skip to content

Find Mode TrustVerify #3604

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 15 commits into
base: develop
Choose a base branch
from
Open

Find Mode TrustVerify #3604

wants to merge 15 commits into from

Conversation

cderb
Copy link
Contributor

@cderb cderb commented Mar 11, 2025

Additional MIOPEN_FIND_MODE = 6 (TrustVerify)
This mode extends DynamicHybrid

Running with TrustVerify will first attempt to load tuning results from system resources
If no solution is returned tuning will be triggered
If a solution is retuned the user find db will be checked for the solution
If solution is from the user find db it will be used
If a solution is from the system (db or model), the solution will be evaluated and the new time will be compared to the time reported by the solution
If the evaluated / reported time is less than the tolerance threshold then the system result is added to the user db and the solution is returned
If the evaluated / reported time exceeds the tolerance, then tuning will be triggered

This find mode will ensure that all configurations are tuned for the deployment system. Solutions from system resources are verified once and tuned if markedly different from expectation. Results from user dbs are considered reliable and used without further verification.

Comment on lines 667 to 668
miopenConvolutionFindModeTrustVerify =
6,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs docs like the others.
The PR description is probably pretty close to what is needed!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added doc

@cderb
Copy link
Contributor Author

cderb commented Mar 12, 2025

The performance of find mode TrustVerify is between the performance of find mode DynamicHybrid and find enforce SearchDbUpdate.
TrustVerify will have the performance of SearchDbUpdate when

  • There is no system db or model solution
  • The solution returned from system is a degree slower than reported by the solution itself based on a 1 time evaluation
    TrustVerify will have the performance of DynamicHybrid when
  • The solution returned from system resource performs as expected based on 1 time evaluation
  • Solution is present in user db

First run with TrustVerify is generally slower than DynamicHybrid.
After the first run, the user find db will be fully populated and user perf db populated with regenerated entries. All following runs should exhibit ideal runtimes.

@cderb cderb requested a review from a team as a code owner March 12, 2025 20:23
@@ -29,7 +29,6 @@ Enable this feature using these commands:

.. code:: bash

export MIOPEN_FIND_MODE=3
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found that setting find enforce disables find mode inputs. So removing these unnecessary lines.

Copy link
Contributor

@amd-jnovotny amd-jnovotny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I've made a couple of minor editorial suggestions.

cderb and others added 2 commits March 12, 2025 15:29
Comment on lines +283 to +284
auto ufdb_sols = miopen::GetSolutions<UserFindDb>(ctx, problem, 1, &invoke_ctx);
if(ufdb_sols.empty())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we need to read the DB twice.
I don't think it's a huge deal since it'll be cached, but if we knew which DB the result came from we could avoid the extra read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, we need to know if the entry is from the system or the user db, as there are different actions depending. There doesn't appear to be a return from the GetSolutions function presently which would indicate that.

false);

const float eval_time = eval_sols.front().GetTime();
constexpr float VERIFY_TOLERANCE = 1.10f;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10% is pretty tight, but maybe we are okay with that?

Copy link
Contributor Author

@cderb cderb Mar 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's ok to loosen this tolerance. I've also seen much higher degrees of variation even when the selected solver and solver parameters are correct.

Copy link
Contributor

@BrianHarrisonAMD BrianHarrisonAMD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM just a couple comments to consider, but I don't think they are blocking.

@BradPepersAMD
Copy link
Collaborator

Can we get a list of current issues this should fix or improve? As well as any situations that this may cause drops compared to existing results? In particular, I'm thinking of cases on distributed runs where they may be starting with a blank user db on every node.

@cderb
Copy link
Contributor Author

cderb commented Apr 14, 2025

This find mode option will function like MIOPEN_FIND_ENFORCE=3 in the worst case. Worst case is when either there is no entry or the system entry is much slower than advertised.
In the best case the entry is either in the user db or the system entry is acceptable (at which point the system entry becomes a user entry). This would be a simple db recall.
This strategy is best for long running workloads as it guarantees that configurations used are optimal.
Assuming the system entries are perfect this mode will be slower on first run than the current default DYNAMIC_HYBRID. This is due to the overhead of benchmarking the kernels to verify their runtimes.

I had wanted to split the option for tuning individual solvers to MIOPEN_FIND_ENFORCE, but this env overrides MIOPEN_FIND_MODE. So presently TRUST_VERIFY also effectively enforces MIOPEN_FIND_ENFORCE=SEARCH_DB_UPDATE, which can take quite some time. Having another option that forgoes individual solver tuning may give better results for shorter running applications and would cap the runtime closer to find mode DYNAMIC_HYBRID.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants