Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2024.10.17 preset extensions #284

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

dbarry9
Copy link
Contributor

@dbarry9 dbarry9 commented Nov 15, 2024

Pull Request Description

This PR adds support for presets defined using native events from non-CPU components. The only component that has been modified in this PR is the 'cuda' component, which is limited to loading the preset table for 'cuda' presets.

These changes have been tested on the NVIDIA Grace-Hopper architecture.

Author Checklist

  • Description
    Why this PR exists. Reference all relevant information, including background, issues, test failures, etc
  • Commits
    Commits are self contained and only do one thing
    Commits have a header of the form: module: short description
    Commits have a body (whenever relevant) containing a detailed description of the addressed problem and its solution
  • Tests
    The PR needs to pass all the tests

@dbarry9 dbarry9 added the status-work-in-progress PR is still being worked on label Nov 15, 2024
i &= PAPI_PRESET_AND_MASK;

/* Iterate over all or all available presets. */
if ( modifier == PAPI_ENUM_EVENTS || modifier == PAPI_PRESET_ENUM_AVAIL ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tested both of these modifiers:

  • PAPI_ENUM_EVENTS, does go over all available CPU Events, but for the Cuda presets I output PAPI_CUDA_HP_FMA (the first listed Cuda preset on hopper1 at Oregon) and the other three events return PAPI_ENOTPRESET.
  • PAPI_PRESET_ENUM_AVAIL, does not output just the available preset events.

Copy link
Contributor Author

@dbarry9 dbarry9 Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Treece. I have pushed changes to address the second bullet. However, I am unable to reproduce the first bullet. Are you able to reproduce this, now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For PAPI_enum_event, both modifiers PAPI_ENUM_EVENTS and PAPI_PRESET_ENUM_AVAIL behave as expected. I tested this on a single H100.

The error PAPI_ENOTPRESET does not come from PAPI_enum_event, but comes from PAPI_event_code_to_name (sorry for the confusion). The first shown Cuda preset (PAPI_CUDA_FP16_FMA) does correctly convert, but when I tried with the respective hex codes for PAPI_CUDA_FP32_FMA, PAPI_CUDA_FP64_FMA, PAPI_CUDA_FP_FMA, and PAPI_CUDA_FP8_OPS I am met with PAPI_ENOTPRESET.

Here is a reproducer:

#include "papi.h"

#include <stdio.h>
#include <stdlib.h>

int main() 
{
    int retval;
    char evtName[PAPI_MAX_STR_LEN];

    retval = PAPI_library_init(PAPI_VER_CURRENT);
    if (retval != PAPI_VER_CURRENT) {
        printf("Failed init: %d\n", retval);
        exit(1);
    }   

    //int evtCode = 0x80000080; // PAPI_CUDA_FP16_FMA and this works
    int evtCode = 0x80000082; // PAPI_CUDA_FP32_FMA and this does not work
    retval = PAPI_event_code_to_name(evtCode, evtName);
    if (retval != PAPI_OK) {
        printf("Failed code to name: %d\n", retval);
        exit(1);
    }   
    printf("Event Name is: %s\n", evtName);

    return PAPI_OK;   
}

i &= PAPI_PRESET_AND_MASK;

/* Iterate over all or all available presets. */
if ( modifier == PAPI_ENUM_EVENTS || modifier == PAPI_PRESET_ENUM_AVAIL ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the CPU:

PAPI_PRESET_ENUM_AVAIL is showing 108 available preset events on the machine I am testing on. While the master branch is only showing 47 (we should get 47 back).

For the GPU:
PAPI_ENUM_EVENTS is only showing PAPI_CUDA_HP_FMA when starting at 0x80000080. Likewise for PAPI_PRESET_ENUM_AVAIL.

Copy link
Contributor Author

@dbarry9 dbarry9 Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Treece. I have pushed changes to address the first bullet. However, I am unable to reproduce the second bullet. Are you able to reproduce this, now?

Copy link
Contributor

@Treece-Burgess Treece-Burgess Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the first bullet point, I still end up seeing all CPU presets rather than just the available presets. This does not occur in the master branch. Reproducer below:

#include "papi.h"

#include <stdio.h>
#include <stdlib.h>

int main() 
{
    int retval;

    retval = PAPI_library_init(PAPI_VER_CURRENT);
    if (retval != PAPI_VER_CURRENT) {
        printf("Failed init: %d\n", retval);
        exit(1);
    }   

    int evtCode = 0x80000000;
    int modifier = PAPI_PRESET_ENUM_AVAIL;
    int cidx = PAPI_get_component_index("perf_event");
    do {

        printf("Code: %x\n", evtCode);

    } while (PAPI_enum_cmp_event(&evtCode, modifier, cidx) == PAPI_OK); 


    return PAPI_OK;   
}

For the second bullet point, I have a reproducer below for this and you should be able to reproduce this on Gilgamesh at Oregon which has an H100. The current behavior is that only 80000080 (PAPI_CUDA_FP16_FMA) will output.

#include "papi.h"

#include <stdio.h>
#include <stdlib.h>

int main() 
{
    int retval;

    retval = PAPI_library_init(PAPI_VER_CURRENT);
    if (retval != PAPI_VER_CURRENT) {
        printf("Failed init: %d\n", retval);
        exit(1);
    }   

    int evtCode = 0x80000080;
    int modifier = PAPI_ENUM_EVENTS; // Also occurs for PAPI_PRESET_ENUM_AVAIL
    int cidx = PAPI_get_component_index("cuda");
    do {

        printf("Code: %x\n", evtCode);

    } while (PAPI_enum_cmp_event(&evtCode, modifier, cidx) == PAPI_OK); 


    return PAPI_OK;   
}

@dbarry9 dbarry9 force-pushed the 2024.10.17_preset-extensions branch 4 times, most recently from 935ea80 to eaa020e Compare November 25, 2024 18:25
@dbarry9 dbarry9 force-pushed the 2024.10.17_preset-extensions branch 4 times, most recently from b57602c to 6c536d1 Compare November 27, 2024 23:09
@dbarry9 dbarry9 force-pushed the 2024.10.17_preset-extensions branch 2 times, most recently from 1212e11 to 6406ed1 Compare December 5, 2024 20:23
@Treece-Burgess Treece-Burgess added the update-presets PRs related to updating the PAPI presets label Dec 19, 2024
@dbarry9 dbarry9 force-pushed the 2024.10.17_preset-extensions branch 12 times, most recently from 34fb833 to c8727c1 Compare February 11, 2025 17:22
@dbarry9 dbarry9 force-pushed the 2024.10.17_preset-extensions branch 2 times, most recently from de36142 to a4c22a4 Compare February 11, 2025 21:57
@Treece-Burgess
Copy link
Contributor

@dbarry9 On an H100, when I run papi_avail I get the following output:

================================================================================
  PAPI Component Presets
================================================================================
       Name           Code    Avail Deriv Description (Note)
PAPI_CUDA_FP16_FMA 0x80000080  Yes   No   CUDA Half precision (FP16) FMA instructions
    :device=0
        Mandatory device qualifier [0]
--------------------------------------------------------------------------------
PAPI_CUDA_FP32_FMA 0x80000082  Yes   No   CUDA Single precision (FP32) FMA instructions
    :device=0
        Mandatory device qualifier [0]
--------------------------------------------------------------------------------
PAPI_CUDA_FP64_FMA 0x80000083  Yes   No   CUDA Double precision (FP64) FMA instructions
    :device=0
        Mandatory device qualifier [0]
--------------------------------------------------------------------------------
PAPI_CUDA_FP_FMA   0x80000084  Yes   Yes  CUDA floating-point FMA instructions
    :device=0
        Mandatory device qualifier [0]
--------------------------------------------------------------------------------
PAPI_CUDA_FP8_OPS  0x80000085  Yes   No   CUDA 8-bit precision floating-point operations
    :device=0
        Mandatory device qualifier [0]
================================================================================

Should the 0x80000081 GPU preset which I believe is PAPI_CUDA_BF16_FMA not output as well?

@Treece-Burgess
Copy link
Contributor

@dbarry9 For CPU presets, you could take PAPI_TOT_INS and use it within application code. However, if I try to do the same for the Cuda GPU presets, I will run into this error:

test_pr284.c:19:19: error: ‘PAPI_CUDA_FP16_FMA’ undeclared (first use in this function); did you mean ‘PAPI_DP_OPS_idx’?
     int evtCode = PAPI_CUDA_FP16_FMA;
                   ^~~~~~~~~~~~~~~~~~
                   PAPI_DP_OPS_idx

Is this behavior expected?

cuda,GH100
cuda,GA106
cuda,GA100
#
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dbarry9 The papi_events.csv file is missing PAPI_CUDA_BF16_FMA.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not yet done an exhaustive sweep of native events to define this. It might be better to not include PAPI_CUDA_BF16_FMA (in the array located in components/cuda/papi_cuda_presets.h) until we have at least one architecture with verified support for PAPI_CUDA_BF16_FMA.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would make sense to do.

One thing I will want to test if PAPI_CUDA_BF16_FMA is removed from the array in papi_cuda_presets.h is if the Cuda presets shown in papi_avail will then not skip the event code 0x80000081.

@Treece-Burgess
Copy link
Contributor

If you do not compile in the Cuda component and then run papi_avail, the section for PAPI Component Presets will show:

================================================================================
  PAPI Component Presets
================================================================================
Name   Code    Avail Deriv Description (Note)
================================================================================

If the CPU presets are not defined on an architecture or I use --disable-cpu, running papi_avail will still output the CPU presets, but just state that they are not available.

Would it be a good idea here to follow suite with how the CPU presets handle this or output a message that states "No components compiled in that support PAPI Component Presets"?

@Treece-Burgess
Copy link
Contributor

I just built PAPI on an A100 with ./configure --prefix=$PWD/test-install --disable-cpu --with-components="cuda". papi_component_avail shows that the Cuda component is active, but when I run ./papi_avail I only see CPU presets. The PAPI Component Presets section is blank:

================================================================================
  PAPI Component Presets
================================================================================
       Name           Code    Avail Deriv Description (Note)
================================================================================
Of 108 possible events, 0 are available, of which 0 are derived.

Along with --disable-cpu, I experience this behavior with --disable-perf-event, but not with --disable-perf-event-uncore.

@@ -1864,6 +2035,9 @@ PAPI_enum_event( int *EventCode, int modifier )
* <li> PAPI_PRESET_ENUM_L3 -- L3 cache related preset events
* <li> PAPI_PRESET_ENUM_TLB -- Translation Lookaside Buffer events
* <li> PAPI_PRESET_ENUM_FP -- Floating Point related preset events
* <li> PAPI_PRESET_ENUM_CPU -- enumerate CPU preset events
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new preset modifiers are mentioned here in the comments for PAPI_enum_cmp_event, but are implemented in PAPI_enum_event, is this a misplacement or should the modifiers work with PAPI_enum_cmp_event?

@@ -503,6 +504,8 @@ enum {
PAPI_PRESET_ENUM_L3, /**< L3 cache related preset events */
PAPI_PRESET_ENUM_TLB, /**< Translation Lookaside Buffer events */
PAPI_PRESET_ENUM_FP, /**< Floating Point related preset events */
PAPI_PRESET_ENUM_CPU, /**< CPU preset events */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would adding modifiers such as PAPI_PRESET_ENUM_COMP and PAPI_PRESET_ENUM_COMP_AVAIL be beneficial too? Similar to what you have done for the CPU presets.

}
}

if ( devIdx > -1 && devIdx < numDevices ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the devIdx < numDevices check needed? I might be overlooking something, but I do not see a case where devIdx is ever greater than numDevices.

if ( devIdx > -1 && devIdx < numDevices ) {
retval = _papi_load_preset_table_component( cname, arch_name, cidx );
if ( retval != PAPI_OK ) {
return PAPI_ENOEVNT;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the call to _papi_load_preset_table_component failed, would it fall under PAPI_ENOEVENT? I would think just returning retval would be better in this case.

You could also add a SUBDBG, stating that the preset table failed in the Cuda component.

@@ -98,6 +98,7 @@ extern char **_papi_errlist;
#define DONT_NEED_CONTEXT 0

#define PAPI_EVENTS_IN_DERIVED_EVENT 8
#define PAPI_MAX_COMP_QUALS 8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just so I understand, PAPI_MAX_COMP_QUALS is the max number of possible qualifiers a component could have?

new_ntv_name = (char*)malloc(len*sizeof(char));
retval = snprintf(new_ntv_name, len, "%s", _papi_hwi_list[preset_idx].base_name[j]);

if( retval >= len ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check if retval < 0, for an encoding error.

* Qualified name and code, however, do get updated. */
len = strlen(_papi_hwi_list[preset_idx].base_name[j]) + 1;
if( qual_str == NULL ) {
new_ntv_name = (char*)malloc(len*sizeof(char));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as below, check if memory was allocated.

new_ntv_name = (char*)malloc(len*sizeof(char));
retval = snprintf(new_ntv_name, len, "%s%s", _papi_hwi_list[preset_idx].base_name[j], qual_str);

if( retval >= len ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check if retval < 0, for an encoding error.

papi_return( PAPI_EINVAL );
}

char *qual_str = index(in, ':');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move this down to outside the for loop at line 1651, as the first use seems to be inside that for loop.

@@ -6930,6 +6932,87 @@ if test "$with_sysdetect" = "yes"; then
fi
fi

# includes for preset headers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Placing here as this is a bug I noticed when running ./configure.

The following line that outputs during configure: checking for components to build... perf_event perf_event_uncore cuda does not show sysdetect, even though it is compiled in. The master branch will show sysdetect.

Create fields in the vector struct for components to define presets.

These changes have been tested on the NVIDIA Hopper architecture.
Add functions to facilitate CUDA presets.

These changes have been tested on the NVIDIA Hopper architecture.
Update configure to track both the number of presets per component and
the arrays of presets belonging to each component.

These changes have been tested on the NVIDIA Hopper architecture.
Updates to framework to facilitate preset events defined by native
events of non-perf_event components.

These changes have been tested on the NVIDIA Hopper architecture.
This is an aesthetic change to improve the development process.

These changes have been tested on the NVIDIA Grace-Hopper architecture.
Replace modifiers with only those that enumerate the CPU preset events.

These changes have been tested on the NVIDIA Grace-Hopper architecture.
Enumerate presets for components as well as the CPU.

These changes have been tested on the NVIDIA Grace-Hopper architecture.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status-work-in-progress PR is still being worked on update-presets PRs related to updating the PAPI presets
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants