Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ch4/ofi: Include NIC information in error messages #7224

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

raffenet
Copy link
Contributor

@raffenet raffenet commented Nov 25, 2024

Pull Request Description

On systems with multiple NICs, it could be helpful to know which NIC an
error was detected on in case there are hardware issues that need
investigating. Add NIC information to error checking macros. For now, we
report the default NIC used by each process. TODO: extend to support
multi-nic usage and take the device number as input for more
fine-grained reporting.

Author Checklist

  • Provide Description
    Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • Commits Follow Good Practice
    Commits are self-contained and do not do two things at once.
    Commit message is of the form: module: short description
    Commit message explains what's in the commit.
  • Passes All Tests
    Whitespace checker. Warnings test. Additional tests via comments.
  • Contribution Agreement
    For non-Argonne authors, check contribution agreement.
    If necessary, request an explicit comment from your companies PR approval manager.

@raffenet
Copy link
Contributor Author

test:mpich/ch4/ofi

The MPIR error checking macros already add __LINE__ and __func__
information when an error is reported. __SHORT_FILE__ is not necessary
since we have the function name, which any decent editor can find
automatically.
On systems with multiple NICs, it could be helpful to know which NIC an
error was detected on in case there are hardware issues that need
investigating. Add NIC information to error checking macros. For now, we
report the default NIC used by each process. TODO: extend to support
multi-nic usage and take the device number as input for more
fine-grained reporting.
@raffenet
Copy link
Contributor Author

test:mpich/ch4/ofi

@@ -38,6 +38,9 @@ ATTRIBUTE((unused));

#define MPIDI_OFI_WIN(win) ((win)->dev.netmod.ofi)

#define MPIDI_OFI_NIC_NAME(nic) (MPIDI_OFI_global.prov_use[nic]->domain_attr->name)
#define MPIDI_OFI_DEFAULT_NIC_NAME (MPIDI_OFI_NIC_NAME(0))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry that in the case we are using non-default nic, then we will be providing misleading information.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a patch to use explicit info.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants