[UR][L0] Fix barrier event cleanup on urEventRelease#21340
[UR][L0] Fix barrier event cleanup on urEventRelease#21340winstonzhang-intel wants to merge 1 commit intointel:syclfrom
Conversation
Repeated barrier/event-wait submissions can accumulate Level Zero event pools when an event is released before its completion flag is set. In that case, the release path may skip the completed-event cleanup and keep queue-held references alive until later command-list recycling. Handle UR_COMMAND_EVENTS_WAIT and UR_COMMAND_EVENTS_WAIT_WITH_BARRIER explicitly in urEventRelease: if completion is not yet marked, query the host-visible event status and, when already signaled, mark completed and run the existing cleanup path immediately. This makes completed barrier/wait events release resources promptly and avoids unbounded zeEventPoolCreate growth in long-running barrier-heavy workloads. Signed-off-by: Zhang, Winston <winston.zhang@intel.com>
1eac543 to
0eb6d24
Compare
| if (isEventsWaitCommand && !isEventDeleted && !isEventsWaitCompleted) { | ||
| ze_result_t QueryRes = ZE_RESULT_NOT_READY; | ||
| { | ||
| std::shared_lock<ur_shared_mutex> EventLock(Event->Mutex); | ||
| auto HostVisibleEvent = Event->HostVisibleEvent; | ||
| if (checkL0LoaderTeardown() && HostVisibleEvent) { | ||
| QueryRes = | ||
| ZE_CALL_NOCHECK(zeEventQueryStatus, (HostVisibleEvent->ZeEvent)); | ||
| } | ||
| } | ||
|
|
||
| if (QueryRes == ZE_RESULT_SUCCESS) { | ||
| isEventsWaitCompleted = true; | ||
| Event->Completed = true; | ||
| } | ||
| } |
There was a problem hiding this comment.
Does this fix a memory leak? What happens if the event is not complete by the time zeEventQueryStatus is called? The event leaks?
The way I understand it works, these events are in the command list's EventList, which is then scanned periodically, on subsequent operations. Does the event get stuck there?
I guess I don't understand why this is needed. It seems inherently race'y to me, and all it would do is deallocate the event a little bit earlier.
There was a problem hiding this comment.
This patch fixes a cleanup timing gap. Namely, before if urEventRelease ran before Event->Completed was set, it skipped the barrier/wait cleanup path, so queue-held refs could hang around much longer than intended.
With my change, now it does a quick zeEventQueryStatus; if the event is already done, it marks complete and cleans up immediately.
In the case that zeEventQueryStatus says not ready, it does not force-delete anything. The event stays owned by queue/command-list tracking and should be reclaimed later by normal recycle/sync paths.
So yes, it can stay in the command list EventList temporarily. The issue this commit addresses is that in barrier-heavy workloads, “temporary” was causing unbounded growth in practice.
In terms of race conditions, this is mostly a best-effort fast path. Worst case it misses completion and cleanup is deferred; it shouldn’t double-free because refcount/cleanup guards still control destruction.
There was a problem hiding this comment.
Still not sure if I get it. Let's assume the application does a lot of enqueue event / barrier wait calls. One of the first things any enqueue function does is getAvailableCommandList. Inside of this function, the EventList should be scanned and completed events should get removed. I'd be curious to understand why this doesn't happen for the scenario you are describing. Is this happening with batching or immediate cmd lists?
In terms of race conditions, this is mostly a best-effort fast path. Worst case it misses completion and cleanup is deferred; it shouldn’t double-free because refcount/cleanup guards still control destruction.
Yeah, I agree that there won't be any functional issues from doing this, but there might be a performance overhead of doing the query unnecessarily. What I meant by race'y is that the code is doing an optimistic query about an asynchronous state that can change immediately after the check.
Anyway, if this really solves the problem, I won't insist. But it looks kinda hacky. Maybe I just don't understand the root cause of the issue.
There was a problem hiding this comment.
getAvailableCommandList does cleanup, but it does not always do so immediately.
Immediate cmd lists: cleanup is threshold/prefix-based and can stop at first incomplete event, so completed ones may stay longer.
Regular cmd lists: cleanup mainly happens on fence-signaled list reuse or explicit reset/sync paths; batching can delay that.
So the issue is deferred reclamation under heavy barrier/event-wait traffic, not that cleanup doesnt happen. Thats why we see the event pool grow and differ in the bug im trying to address.
The query in urEventRelease is a best-effort shortcut. ie, if already complete, clean now, otherwise, fall back to normal deferred cleanup.
There was a problem hiding this comment.
I'm still not convinced this addresses the root cause of the issue. The adapter caps the number of pending events at 1024. That is not a big enough number to be noticeable as a memory leak.
However, I agree that this won't break things, except for a potential performance regression. So I'd be OK with merging this, if it fixes the immediate problem.
Repeated barrier/event-wait submissions can accumulate Level Zero event pools when an event is released before its completion flag is set. In that case, the release path may skip the completed-event cleanup and keep queue-held references alive until later command-list recycling.
Handle UR_COMMAND_EVENTS_WAIT and UR_COMMAND_EVENTS_WAIT_WITH_BARRIER explicitly in urEventRelease: if completion is not yet marked, query the host-visible event status and, when already signaled, mark completed and run the existing cleanup path immediately.
This makes completed barrier/wait events release resources promptly and avoids unbounded zeEventPoolCreate growth in long-running barrier-heavy workloads.