Skip to content

[UR][L0] Fix barrier event cleanup on urEventRelease#21340

Open
winstonzhang-intel wants to merge 1 commit intointel:syclfrom
winstonzhang-intel:urlza-710
Open

[UR][L0] Fix barrier event cleanup on urEventRelease#21340
winstonzhang-intel wants to merge 1 commit intointel:syclfrom
winstonzhang-intel:urlza-710

Conversation

@winstonzhang-intel
Copy link
Contributor

Repeated barrier/event-wait submissions can accumulate Level Zero event pools when an event is released before its completion flag is set. In that case, the release path may skip the completed-event cleanup and keep queue-held references alive until later command-list recycling.

Handle UR_COMMAND_EVENTS_WAIT and UR_COMMAND_EVENTS_WAIT_WITH_BARRIER explicitly in urEventRelease: if completion is not yet marked, query the host-visible event status and, when already signaled, mark completed and run the existing cleanup path immediately.

This makes completed barrier/wait events release resources promptly and avoids unbounded zeEventPoolCreate growth in long-running barrier-heavy workloads.

Repeated barrier/event-wait submissions can accumulate Level Zero event pools
when an event is released before its completion flag is set. In that case,
the release path may skip the completed-event cleanup and keep queue-held
references alive until later command-list recycling.

Handle UR_COMMAND_EVENTS_WAIT and UR_COMMAND_EVENTS_WAIT_WITH_BARRIER
explicitly in urEventRelease: if completion is not yet marked, query the
host-visible event status and, when already signaled, mark completed and run
the existing cleanup path immediately.

This makes completed barrier/wait events release resources promptly and avoids
unbounded zeEventPoolCreate growth in long-running barrier-heavy workloads.

Signed-off-by: Zhang, Winston <winston.zhang@intel.com>
Copy link

@aravindksg aravindksg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Comment on lines +900 to +915
if (isEventsWaitCommand && !isEventDeleted && !isEventsWaitCompleted) {
ze_result_t QueryRes = ZE_RESULT_NOT_READY;
{
std::shared_lock<ur_shared_mutex> EventLock(Event->Mutex);
auto HostVisibleEvent = Event->HostVisibleEvent;
if (checkL0LoaderTeardown() && HostVisibleEvent) {
QueryRes =
ZE_CALL_NOCHECK(zeEventQueryStatus, (HostVisibleEvent->ZeEvent));
}
}

if (QueryRes == ZE_RESULT_SUCCESS) {
isEventsWaitCompleted = true;
Event->Completed = true;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this fix a memory leak? What happens if the event is not complete by the time zeEventQueryStatus is called? The event leaks?

The way I understand it works, these events are in the command list's EventList, which is then scanned periodically, on subsequent operations. Does the event get stuck there?

I guess I don't understand why this is needed. It seems inherently race'y to me, and all it would do is deallocate the event a little bit earlier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This patch fixes a cleanup timing gap. Namely, before if urEventRelease ran before Event->Completed was set, it skipped the barrier/wait cleanup path, so queue-held refs could hang around much longer than intended.
With my change, now it does a quick zeEventQueryStatus; if the event is already done, it marks complete and cleans up immediately.

In the case that zeEventQueryStatus says not ready, it does not force-delete anything. The event stays owned by queue/command-list tracking and should be reclaimed later by normal recycle/sync paths.

So yes, it can stay in the command list EventList temporarily. The issue this commit addresses is that in barrier-heavy workloads, “temporary” was causing unbounded growth in practice.

In terms of race conditions, this is mostly a best-effort fast path. Worst case it misses completion and cleanup is deferred; it shouldn’t double-free because refcount/cleanup guards still control destruction.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not sure if I get it. Let's assume the application does a lot of enqueue event / barrier wait calls. One of the first things any enqueue function does is getAvailableCommandList. Inside of this function, the EventList should be scanned and completed events should get removed. I'd be curious to understand why this doesn't happen for the scenario you are describing. Is this happening with batching or immediate cmd lists?

In terms of race conditions, this is mostly a best-effort fast path. Worst case it misses completion and cleanup is deferred; it shouldn’t double-free because refcount/cleanup guards still control destruction.

Yeah, I agree that there won't be any functional issues from doing this, but there might be a performance overhead of doing the query unnecessarily. What I meant by race'y is that the code is doing an optimistic query about an asynchronous state that can change immediately after the check.

Anyway, if this really solves the problem, I won't insist. But it looks kinda hacky. Maybe I just don't understand the root cause of the issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getAvailableCommandList does cleanup, but it does not always do so immediately.

Immediate cmd lists: cleanup is threshold/prefix-based and can stop at first incomplete event, so completed ones may stay longer.
Regular cmd lists: cleanup mainly happens on fence-signaled list reuse or explicit reset/sync paths; batching can delay that.

So the issue is deferred reclamation under heavy barrier/event-wait traffic, not that cleanup doesnt happen. Thats why we see the event pool grow and differ in the bug im trying to address.
The query in urEventRelease is a best-effort shortcut. ie, if already complete, clean now, otherwise, fall back to normal deferred cleanup.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not convinced this addresses the root cause of the issue. The adapter caps the number of pending events at 1024. That is not a big enough number to be noticeable as a memory leak.
However, I agree that this won't break things, except for a potential performance regression. So I'd be OK with merging this, if it fixes the immediate problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants