ch4: implement dynamic netmod av entry size #7204
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Pull Request Description
With support of multiple nics and multiple vcis at build time, each av (address vector) entry is expanded into
[MAX_NICS][MAX_VCIS]
even though most applications don't utilize all the nics and all vcis per process, especially in high PPN cases. This may cause significant memory pressure when launch jobs with high PPN and large number of nodes.To avoid wasting memory and adapt the av entry size to runtime variables, such as
MPIR_CVAR_CH4_OFI_MAX_NICS
andMPIR_CVAR_CH4_NUM_VCIS
, let netmod report the neededav_entry_size
during "MPIDI_NM_init_local` and use pointer arithmetic to lookup av entries.Also, set the default
MPIR_CVAR_CH4_OFI_MAX_NICS
to 1 to avoid initializing extra nics in each process. Apparently, libfabric may use a lot of memory for its av table, more so than MPICH's av table. Avoiding initializing extra nics greatly saves memory.Concern
AV look up is in the hot path. This patch adds a memory load for
MPIDI_global.av_entry_size
.TODO:
Author Checklist
Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
Commits are self-contained and do not do two things at once.
Commit message is of the form:
module: short description
Commit message explains what's in the commit.
Whitespace checker. Warnings test. Additional tests via comments.
For non-Argonne authors, check contribution agreement.
If necessary, request an explicit comment from your companies PR approval manager.