[UR][L0] Fix barrier event cleanup on urEventRelease by winstonzhang-intel · Pull Request #21340 · intel/llvm

winstonzhang-intel · 2026-02-21T00:34:02Z

Repeated barrier/event-wait submissions can accumulate Level Zero event pools when an event is released before its completion flag is set. In that case, the release path may skip the completed-event cleanup and keep queue-held references alive until later command-list recycling.

Handle UR_COMMAND_EVENTS_WAIT and UR_COMMAND_EVENTS_WAIT_WITH_BARRIER explicitly in urEventRelease: if completion is not yet marked, query the host-visible event status and, when already signaled, mark completed and run the existing cleanup path immediately.

This makes completed barrier/wait events release resources promptly and avoids unbounded zeEventPoolCreate growth in long-running barrier-heavy workloads.

Repeated barrier/event-wait submissions can accumulate Level Zero event pools when an event is released before its completion flag is set. In that case, the release path may skip the completed-event cleanup and keep queue-held references alive until later command-list recycling. Handle UR_COMMAND_EVENTS_WAIT and UR_COMMAND_EVENTS_WAIT_WITH_BARRIER explicitly in urEventRelease: if completion is not yet marked, query the host-visible event status and, when already signaled, mark completed and run the existing cleanup path immediately. This makes completed barrier/wait events release resources promptly and avoids unbounded zeEventPoolCreate growth in long-running barrier-heavy workloads. Signed-off-by: Zhang, Winston <winston.zhang@intel.com>

aravindksg

lgtm

pbalcer · 2026-02-25T11:02:04Z

unified-runtime/source/adapters/level_zero/event.cpp

+  if (isEventsWaitCommand && !isEventDeleted && !isEventsWaitCompleted) {
+    ze_result_t QueryRes = ZE_RESULT_NOT_READY;
+    {
+      std::shared_lock<ur_shared_mutex> EventLock(Event->Mutex);
+      auto HostVisibleEvent = Event->HostVisibleEvent;
+      if (checkL0LoaderTeardown() && HostVisibleEvent) {
+        QueryRes =
+            ZE_CALL_NOCHECK(zeEventQueryStatus, (HostVisibleEvent->ZeEvent));
+      }
+    }
+
+    if (QueryRes == ZE_RESULT_SUCCESS) {
+      isEventsWaitCompleted = true;
+      Event->Completed = true;
+    }
+  }


Does this fix a memory leak? What happens if the event is not complete by the time zeEventQueryStatus is called? The event leaks?

The way I understand it works, these events are in the command list's EventList, which is then scanned periodically, on subsequent operations. Does the event get stuck there?

I guess I don't understand why this is needed. It seems inherently race'y to me, and all it would do is deallocate the event a little bit earlier.

This patch fixes a cleanup timing gap. Namely, before if urEventRelease ran before Event->Completed was set, it skipped the barrier/wait cleanup path, so queue-held refs could hang around much longer than intended.
With my change, now it does a quick zeEventQueryStatus; if the event is already done, it marks complete and cleans up immediately.

In the case that zeEventQueryStatus says not ready, it does not force-delete anything. The event stays owned by queue/command-list tracking and should be reclaimed later by normal recycle/sync paths.

So yes, it can stay in the command list EventList temporarily. The issue this commit addresses is that in barrier-heavy workloads, “temporary” was causing unbounded growth in practice.

In terms of race conditions, this is mostly a best-effort fast path. Worst case it misses completion and cleanup is deferred; it shouldn’t double-free because refcount/cleanup guards still control destruction.

Still not sure if I get it. Let's assume the application does a lot of enqueue event / barrier wait calls. One of the first things any enqueue function does is getAvailableCommandList. Inside of this function, the EventList should be scanned and completed events should get removed. I'd be curious to understand why this doesn't happen for the scenario you are describing. Is this happening with batching or immediate cmd lists?

In terms of race conditions, this is mostly a best-effort fast path. Worst case it misses completion and cleanup is deferred; it shouldn’t double-free because refcount/cleanup guards still control destruction.

Yeah, I agree that there won't be any functional issues from doing this, but there might be a performance overhead of doing the query unnecessarily. What I meant by race'y is that the code is doing an optimistic query about an asynchronous state that can change immediately after the check.

Anyway, if this really solves the problem, I won't insist. But it looks kinda hacky. Maybe I just don't understand the root cause of the issue.

getAvailableCommandList does cleanup, but it does not always do so immediately.

Immediate cmd lists: cleanup is threshold/prefix-based and can stop at first incomplete event, so completed ones may stay longer.
Regular cmd lists: cleanup mainly happens on fence-signaled list reuse or explicit reset/sync paths; batching can delay that.

So the issue is deferred reclamation under heavy barrier/event-wait traffic, not that cleanup doesnt happen. Thats why we see the event pool grow and differ in the bug im trying to address.
The query in urEventRelease is a best-effort shortcut. ie, if already complete, clean now, otherwise, fall back to normal deferred cleanup.

I'm still not convinced this addresses the root cause of the issue. The adapter caps the number of pending events at 1024. That is not a big enough number to be noticeable as a memory leak.
However, I agree that this won't break things, except for a potential performance regression. So I'd be OK with merging this, if it fixes the immediate problem.

winstonzhang-intel temporarily deployed to WindowsCILock February 21, 2026 00:34 — with GitHub Actions Inactive

winstonzhang-intel temporarily deployed to WindowsCILock February 21, 2026 00:56 — with GitHub Actions Inactive

winstonzhang-intel had a problem deploying to WindowsCILock February 21, 2026 00:56 — with GitHub Actions Failure

winstonzhang-intel temporarily deployed to WindowsCILock February 21, 2026 00:56 — with GitHub Actions Inactive

winstonzhang-intel had a problem deploying to WindowsCILock February 24, 2026 21:37 — with GitHub Actions Error

winstonzhang-intel marked this pull request as ready for review February 24, 2026 21:38

winstonzhang-intel requested a review from a team as a code owner February 24, 2026 21:38

winstonzhang-intel requested review from mateuszpn and pbalcer February 24, 2026 21:38

winstonzhang-intel force-pushed the urlza-710 branch from 1eac543 to 0eb6d24 Compare February 24, 2026 21:40

winstonzhang-intel temporarily deployed to WindowsCILock February 24, 2026 21:40 — with GitHub Actions Inactive

winstonzhang-intel had a problem deploying to WindowsCILock February 24, 2026 21:58 — with GitHub Actions Failure

winstonzhang-intel temporarily deployed to WindowsCILock February 24, 2026 21:58 — with GitHub Actions Inactive

aravindksg approved these changes Feb 24, 2026

View reviewed changes

winstonzhang-intel had a problem deploying to WindowsCILock February 25, 2026 09:45 — with GitHub Actions Failure

pbalcer reviewed Feb 25, 2026

View reviewed changes

winstonzhang-intel had a problem deploying to WindowsCILock February 25, 2026 18:21 — with GitHub Actions Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[UR][L0] Fix barrier event cleanup on urEventRelease#21340

[UR][L0] Fix barrier event cleanup on urEventRelease#21340
winstonzhang-intel wants to merge 1 commit intointel:syclfrom
winstonzhang-intel:urlza-710

winstonzhang-intel commented Feb 21, 2026

Uh oh!

aravindksg left a comment

Uh oh!

pbalcer Feb 25, 2026

Uh oh!

winstonzhang-intel Feb 25, 2026

Uh oh!

pbalcer Feb 26, 2026

Uh oh!

winstonzhang-intel Feb 26, 2026

Uh oh!

pbalcer Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

winstonzhang-intel commented Feb 21, 2026

Uh oh!

aravindksg left a comment

Choose a reason for hiding this comment

Uh oh!

pbalcer Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

winstonzhang-intel Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

pbalcer Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

winstonzhang-intel Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

pbalcer Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants