Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set the right HIP device before creating base event counter #2276

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

rafbiels
Copy link
Contributor

@rafbiels rafbiels commented Nov 1, 2024

Without any default device in the current thread, all base events were associated with device 0, causing failures when used on other devices. Fix this by calling hipSetDevice before recording the event.

This issue was reported by a user who was running on a system with two AMD GPUs and tried to do the following:

#include <sycl/sycl.hpp>

int main() {
  auto Devs = sycl::device::get_devices(sycl::info::device_type::gpu);
  std::vector<sycl::queue> Queues;
  for (auto D : Devs) {
    Queues.push_back(sycl::queue{D,sycl::property::queue::enable_profiling{}});
  }
}

Resulting in

UR HIP ERROR:
	Value:           400
	Name:            hipErrorInvalidHandle
	Description:     invalid resource handle
	Function:        getElapsedTime
	Source Location: _deps/unified-runtime-src/source/adapters/hip/device.cpp:31

in the constructor of the second queue.

intel/llvm PR: intel/llvm#15964

Without any default device in the current thread, all base events
were associated with device 0, causing failures when used on other
devices. Fix this by calling hipSetDevice before recording the event.
@rafbiels rafbiels requested a review from a team as a code owner November 1, 2024 16:44
@github-actions github-actions bot added the hip HIP adapter specific issues label Nov 1, 2024
@JackAKirk
Copy link
Contributor

This is OK but a downside is that this sets a process on every GPU at platform creation even if the user doesn't use them. This has a one-off overhead in performance but more importantly may lead to memory leak issues in certain cases.
I think the following should work, and is better since it avoids this and also uses the same base event across devices (which is technically required by the spec).
Something like

              hipEvent_t EvBase;
for (auto i = 0u; i < static_cast<uint32_t>(NumDevices); ++i) {
...
if (i == 0)
              UR_CHECK_ERROR(hipEventCreate(&EvBase));

// then use same EvBase to construct each device

The above does work with cuda (see e.g. #2077) and I imagine it also works with hip.
We don't do this in cuda in the current impl but the hip backend is currently implemented in a way where it is more natural to do this.

@rafbiels
Copy link
Contributor Author

rafbiels commented Nov 4, 2024

Thanks for checking @JackAKirk! Unfortunately, I cannot get your suggestion to work. Here's a pure HIP reproducer with what I think you suggested:

#include <hip/hip_runtime.h>
#include <iostream>
#include <string>

int s_errors{0};

void check(hipError_t res, std::string fname) {
  if (res!=hipSuccess) {
    ++s_errors;
    const char *ErrorString = hipGetErrorString(res);
    const char *ErrorName = hipGetErrorName(res);
    std::cout << "HIP error in function " << fname
              << "\nvalue: " << res
              << "\nname: " << ErrorName
              << "\ndescription: " << ErrorString
              << std::endl;
  }
}

float getTime(hipEvent_t& EvBase, int Device) {
  std::cout << "called getTime for device " << Device << std::endl;
  check(hipSetDevice(Device),"hipSetDevice");
  hipEvent_t Event;
  float Milliseconds{0.0f};
  check(hipEventCreateWithFlags(&Event, hipEventDefault),"hipEventCreateWithFlags");
  check(hipEventRecord(Event),"hipEventRecord");
  check(hipEventSynchronize(EvBase),"hipEventSynchronize");
  check(hipEventSynchronize(Event),"hipEventSynchronize");
  check(hipEventElapsedTime(&Milliseconds, EvBase, Event),"hipEventElapsedTime");
  return Milliseconds;
}

int main() {
  hipEvent_t EvBase;
  check(hipEventCreate(&EvBase),"hipEventCreate");
  check(hipEventRecord(EvBase, 0),"hipEventRecord");

  float time0 = getTime(EvBase, 0);
  float time1 = getTime(EvBase, 1);

  std::cout << "time0: " << time0 << " ms\n"
            << "time1: " << time1 << " ms\n"
            << "errors: " << s_errors << std::endl;

  return 0;
}

This prints:

called getTime for device 0
called getTime for device 1
HIP error in function hipEventElapsedTime
value: 400
name: hipErrorInvalidHandle
description: invalid resource handle
time0: 0.03328 ms
time1: 0 ms
errors: 1

whereas replacing main() with the following which represents the current version of this PR:

int main() {
  hipEvent_t EvBase0;
  check(hipSetDevice(0),"hipSetDevice");
  check(hipEventCreate(&EvBase0),"hipEventCreate");
  check(hipEventRecord(EvBase0, 0),"hipEventRecord");

  hipEvent_t EvBase1;
  check(hipSetDevice(1),"hipSetDevice");
  check(hipEventCreate(&EvBase1),"hipEventCreate");
  check(hipEventRecord(EvBase1, 0),"hipEventRecord");

  float time0 = getTime(EvBase0, 0);
  float time1 = getTime(EvBase1, 1);

  std::cout << "time0: " << time0 << " ms\n"
            << "time1: " << time1 << " ms\n"
            << "errors: " << s_errors << std::endl;

  return 0;
}

works fine and prints:

called getTime for device 0
called getTime for device 1
time0: 208.634 ms
time1: 0.08784 ms
errors: 0

It seems to me like hipEventElapsedTime requires the two events to be recorded with the same device context, even though the documentation doesn't say anything about this.

@JackAKirk
Copy link
Contributor

JackAKirk commented Nov 4, 2024

I think the problem may be that you are not setting the device before calling event create/record in

int main() {
  hipEvent_t EvBase;
  check(hipEventCreate(&EvBase),"hipEventCreate");
  check(hipEventRecord(EvBase, 0),"hipEventRecord");

  float time0 = getTime(EvBase, 0);
  float time1 = getTime(EvBase, 1);

  std::cout << "time0: " << time0 << " ms\n"
            << "time1: " << time1 << " ms\n"
            << "errors: " << s_errors << std::endl;

  return 0;
}

It may be that hipEventCreate implicitly sets device zero, in which case you are probably right that hip doesn't support events across devices like cuda cuContext does. But would be good to check this.

@rafbiels
Copy link
Contributor Author

rafbiels commented Nov 4, 2024

After adding hipSetDevice(0) before hipEventCreate(&EvBase) I still get the error.

int main() {
  hipEvent_t EvBase;
  check(hipSetDevice(0),"hipSetDevice");
  check(hipEventCreate(&EvBase),"hipEventCreate");
  check(hipEventRecord(EvBase, 0),"hipEventRecord");

  float time0 = getTime(EvBase, 0);
  float time1 = getTime(EvBase, 1);

  std::cout << "time0: " << time0 << " ms\n"
            << "time1: " << time1 << " ms\n"
            << "errors: " << s_errors << std::endl;

  return 0;
}
called getTime for device 0
called getTime for device 1
HIP error in function hipEventElapsedTime
value: 400
name: hipErrorInvalidHandle
description: invalid resource handle
time0: 0.03088 ms
time1: 0 ms
errors: 1

@JackAKirk
Copy link
Contributor

After adding hipSetDevice(0) before hipEventCreate(&EvBase) I still get the error.

int main() {
  hipEvent_t EvBase;
  check(hipSetDevice(0),"hipSetDevice");
  check(hipEventCreate(&EvBase),"hipEventCreate");
  check(hipEventRecord(EvBase, 0),"hipEventRecord");

  float time0 = getTime(EvBase, 0);
  float time1 = getTime(EvBase, 1);

  std::cout << "time0: " << time0 << " ms\n"
            << "time1: " << time1 << " ms\n"
            << "errors: " << s_errors << std::endl;

  return 0;
}
called getTime for device 0
called getTime for device 1
HIP error in function hipEventElapsedTime
value: 400
name: hipErrorInvalidHandle
description: invalid resource handle
time0: 0.03088 ms
time1: 0 ms
errors: 1

OK, thanks for checking!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hip HIP adapter specific issues ready to merge Added to PR's which are ready to merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants