Skip to content

Commit 7a28b7a

Browse files
[UR][L0v2] add support for batched queue submissions (#19769)
Adding a new feature: batched queue submissions. Batched queues enable submission of operations to the driver in batches, therefore reducing the overhead of submitting every single operation individually. Similarly to command buffers in L0v2, they use regular command lists (later referenced as 'batches'). Operations enqueued on regular command lists are not executed immediately, but only after enqueueing the regular command list on an immediate command list. However, in contrast to command buffers, batched queues also handle submission of batches (regular command lists) instead of only collecting enqueued operations, by using an internal immediate command list. Batched queues introduce: - batch_manager stores the current batch, the command list manager with an immediate command list for batch submissions, the vector of submitted batches, the generation number of the current batch. - The current batch is a command list manager with a regular command list; operations requested by users are enqueued on the current batch. The current batch may be submitted for execution on the immediate command list, replaced by a new regular command list and stored for execution completion in the vector of submitted batches. - The number of regular command lists stored for execution is limited. - The generation number of the current batch is assigned to events associated with operations enqueued on the given batch. It is incremented during every replacement of the current batch. When an event created by a batched queue appears in an eventWaitList, the batch assigned to the given event might not have been executed yet and the event might never be signalled. Comparing generation numbers enables determining whether the current batch should be submitted for execution. If the generation number of the current batch is higher than the number assigned to the given event, the batch associated with the event has already been submitted for execution and additional submission of the current batch is not needed. - Regular command lists use the regular pool cache type, whereas immediate command lists use the immediate pool cache type. Since user-requested operations are enqueued on regular command lists and immediate command lists are only used internally by the batched queue implementation, events are not created for immediate command lists (in most cases; see below). - When a user requests the command list manager to enqueue a command buffer, the regular command list from the command buffer is appended to the command list of the given command list manager. Since regular command lists cannot be enqueued on other regular command lists, but only on immediate command lists, enqueueing command buffers must be performed on an immediate command list. Therefore, an additional event pool with the immediate cache type is introduced in order to provide events for operations requested by users and enqueued directly on an immediate command list. - wait_list_view is modified. Previously, it only stored the waitlist (as a ze_event_handle buffer created from events) and the corresponding event count in a single container, which could be passed as an argument to the driver API. Currently, the constructor also ensures that all associated operations will eventually be executed. Since regular command lists are not executed immediately, but only after enqueueing on immediate lists, it is necessary to enqueue the regular command list associated with the given event. Otherwise, the event would never be signalled. Additionally, support for UR_QUEUE_INFO_FLAGS in urQueueGetInfo has been added for native CPU, which is required by the enqueueTimestampRecording tests. Currently, enqueueTimestampRecording is not supported by batched queues. Batched queues can be enabled by setting UR_QUEUE_FLAG_SUBMISSION_BATCHED in ur_queue_flags_t or globally, through the environment variable UR_L0_V2_FORCE_BATCHED=1. Batched queues are intended to improve performance on platforms, where eager submission is not efficient due to driver limitations. Such hardware includes Xe (and older GPUs) on Windows. There are also workloads which benefit from batched submissions (e.g., dl-cifar). SYCL graphs should be preferred for new software, since they allow for better control of grouped commands submissions. Benchmark results for default in-order queues (sycl branch, commit hash: b76f12e) and batched queues: api_overhead_benchmark_ur SubmitKernel in order: 20.839 μs api_overhead_benchmark_ur SubmitKernel batched: 12.183 μs
1 parent 5a8910d commit 7a28b7a

24 files changed

+2379
-471
lines changed

sycl/doc/EnvironmentVariables.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -273,6 +273,7 @@ older hardware or when SYCL_UR_USE_LEVEL_ZERO_V2=0 is set.</span>
273273
| -------------------- | ------ | ----------- | --------------- |
274274
| `UR_L0_V2_FORCE_DISABLE_COPY_OFFLOAD` | Integer | By default, copy operations submitted to any queue can be offloaded to dedicated copy engines. Setting this variable instructs the driver to keep all copy operations on the engine behind the original queue. The default value is 0. | V2 |
275275
| `UR_L0_V2_DISABLE_ZE_LAUNCH_KERNEL_WITH_ARGS` | Integer | By default, `ZeCommandListAppendLaunchKernelWithArguments()` will be called. Setting this variable instructs the adapter to not call `ZeCommandListAppendLaunchKernelWithArguments()` and use the old path using `ZeCommandListAppendLaunchKernel()`. The default value is 0. | V2 |
276+
| `UR_L0_V2_FORCE_BATCHED` | Any(\*) | Adds UR_QUEUE_FLAG_SUBMISSION_BATCHED flag to the flags passed to urQueueCreate as arguments. The variable does not overwrite other passed flags, therefore invalid combinations (such as setting both UR_QUEUE_FLAG_SUBMISSION_IMMEDIATE and UR_QUEUE_FLAG_SUBMISSION_BATCHED) are possible. | V2 |
276277
| `SYCL_PI_LEVEL_ZERO_SINGLE_THREAD_MODE` | Integer | A single-threaded app has an opportunity to enable this mode to avoid overhead from mutex locking in the Level Zero adapter. A value greater than 0 enables single thread mode. A value of 0 disables single thread mode. The default is 0. | Legacy |
277278
| `SYCL_PI_LEVEL_ZERO_USM_ALLOCATOR` | [EnableBuffers][;[MaxPoolSize][;[host\|device\|shared:][MaxPoolableSize][,[Capacity][,SlabMinSize]]]...] | EnableBuffers enables pooling for SYCL buffers, default 1, set to 0 to disable. MaxPoolSize is the maximum size of the pool, by default there is no size limit. MemType is host, device, shared or read_only_shared. Other parameters are values specified as positive integers with optional K, M or G suffix. MaxPoolableSize is the maximum allocation size that may be pooled, default 0 for shared, 2MB for host, 4MB for device and read_only_shared. Capacity is the number of allocations in each size range freed by the program but retained in the pool for reallocation, default 4. Size ranges follow this pattern: 64, 96, 128, 192, and so on, i.e., powers of 2, with one range in between. SlabMinSize is the minimum allocation size, 64KB for host and device, 2MB for shared and read_only_shared. Example: SYCL_PI_LEVEL_ZERO_USM_ALLOCATOR=1;32M;host:1M,4,64K;device:1M,4,64K;shared:0,0,2M| Legacy and V2 |
278279
| `SYCL_PI_LEVEL_ZERO_BATCH_SIZE` | Integer | Sets a preferred number of compute commands to batch into a command list before executing the command list. A value of 0 causes the batch size to be adjusted dynamically. A value greater than 0 specifies fixed size batching, with the batch size set to the specified value. The default is 0. | Legacy |
@@ -293,6 +294,8 @@ older hardware or when SYCL_UR_USE_LEVEL_ZERO_V2=0 is set.</span>
293294
| `SYCL_PI_LEVEL_ZERO_USM_RESIDENT` | Integer | Bit-mask controls if/where to make USM allocations resident at the time of allocation. Input value is of the form 0xHSD, where 4-bits of D control device allocations, 4-bits of S control shared allocations, and 4-bits of H control host allocations. Each 4-bit component is holding one of the following values: "0" - then no special residency is forced, "1" - then allocation is made resident at the device of allocation, or "2" - then allocation is made resident on all devices in the context of allocation that have P2P access to the device of allocation. Default is 0x002, i.e. force full residency for device allocations only. | Legacy |
294295
| `SYCL_PI_LEVEL_ZERO_USE_NATIVE_USM_MEMCPY2D` | Integer | When set to a positive value enables the use of Level Zero USM 2D memory copy operations. Default is 0. | Legacy |
295296

297+
`(*) Note: Any means this environment variable is effective when set to any non-null value.`
298+
296299
## Debugging variables for CUDA Adapter
297300

298301
:warning: **Warning:** <span style="color:red">the environment variables

unified-runtime/scripts/templates/queue_api.hpp.mako

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,8 +25,9 @@ from templates import helper as th
2525
#pragma once
2626

2727
#include <ur_api.h>
28+
#include "queue_extensions.hpp"
2829

29-
struct ur_queue_t_ {
30+
struct ur_queue_t_ : ur_queue_extensions {
3031
virtual ~ur_queue_t_();
3132

3233
%for obj in th.get_queue_related_functions(specs, n, tags):

unified-runtime/source/adapters/level_zero/CMakeLists.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -171,6 +171,7 @@ if(UR_BUILD_ADAPTER_L0_V2)
171171
${CMAKE_CURRENT_SOURCE_DIR}/v2/memory.hpp
172172
${CMAKE_CURRENT_SOURCE_DIR}/v2/lockable.hpp
173173
${CMAKE_CURRENT_SOURCE_DIR}/v2/queue_api.hpp
174+
${CMAKE_CURRENT_SOURCE_DIR}/v2/queue_batched.hpp
174175
${CMAKE_CURRENT_SOURCE_DIR}/v2/queue_immediate_in_order.hpp
175176
${CMAKE_CURRENT_SOURCE_DIR}/v2/queue_immediate_out_of_order.hpp
176177
${CMAKE_CURRENT_SOURCE_DIR}/v2/usm.hpp
@@ -187,6 +188,7 @@ if(UR_BUILD_ADAPTER_L0_V2)
187188
${CMAKE_CURRENT_SOURCE_DIR}/v2/kernel.cpp
188189
${CMAKE_CURRENT_SOURCE_DIR}/v2/memory.cpp
189190
${CMAKE_CURRENT_SOURCE_DIR}/v2/queue_api.cpp
191+
${CMAKE_CURRENT_SOURCE_DIR}/v2/queue_batched.cpp
190192
${CMAKE_CURRENT_SOURCE_DIR}/v2/queue_create.cpp
191193
${CMAKE_CURRENT_SOURCE_DIR}/v2/queue_immediate_in_order.cpp
192194
${CMAKE_CURRENT_SOURCE_DIR}/v2/queue_immediate_out_of_order.cpp

unified-runtime/source/adapters/level_zero/v2/command_buffer.cpp

Lines changed: 62 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@
1212
#include "../command_buffer_command.hpp"
1313
#include "../helpers/kernel_helpers.hpp"
1414
#include "../ur_interface_loader.hpp"
15+
#include "command_list_manager.hpp"
1516
#include "logger/ur_logger.hpp"
1617
#include "queue_handle.hpp"
1718

@@ -328,9 +329,12 @@ ur_result_t urCommandBufferAppendKernelLaunchExp(
328329
auto eventsWaitList = commandBuffer->getWaitListFromSyncPoints(
329330
syncPointWaitList, numSyncPointsInWaitList);
330331

332+
wait_list_view waitListView =
333+
wait_list_view(eventsWaitList, numSyncPointsInWaitList);
334+
331335
UR_CALL(commandListLocked->appendKernelLaunch(
332336
hKernel, workDim, pGlobalWorkOffset, pGlobalWorkSize, pLocalWorkSize,
333-
nullptr, numSyncPointsInWaitList, eventsWaitList,
337+
nullptr, waitListView,
334338
commandBuffer->createEventIfRequested(retSyncPoint)));
335339

336340
return UR_RESULT_SUCCESS;
@@ -353,8 +357,11 @@ ur_result_t urCommandBufferAppendUSMMemcpyExp(
353357
auto eventsWaitList = hCommandBuffer->getWaitListFromSyncPoints(
354358
pSyncPointWaitList, numSyncPointsInWaitList);
355359

360+
wait_list_view waitListView =
361+
wait_list_view(eventsWaitList, numSyncPointsInWaitList);
362+
356363
UR_CALL(commandListLocked->appendUSMMemcpy(
357-
false, pDst, pSrc, size, numSyncPointsInWaitList, eventsWaitList,
364+
false, pDst, pSrc, size, waitListView,
358365
hCommandBuffer->createEventIfRequested(pSyncPoint)));
359366

360367
return UR_RESULT_SUCCESS;
@@ -380,9 +387,12 @@ ur_result_t urCommandBufferAppendMemBufferCopyExp(
380387
auto eventsWaitList = hCommandBuffer->getWaitListFromSyncPoints(
381388
pSyncPointWaitList, numSyncPointsInWaitList);
382389

390+
wait_list_view waitListView =
391+
wait_list_view(eventsWaitList, numSyncPointsInWaitList);
392+
383393
UR_CALL(commandListLocked->appendMemBufferCopy(
384-
hSrcMem, hDstMem, srcOffset, dstOffset, size, numSyncPointsInWaitList,
385-
eventsWaitList, hCommandBuffer->createEventIfRequested(pSyncPoint)));
394+
hSrcMem, hDstMem, srcOffset, dstOffset, size, waitListView,
395+
hCommandBuffer->createEventIfRequested(pSyncPoint)));
386396

387397
return UR_RESULT_SUCCESS;
388398
} catch (...) {
@@ -407,9 +417,12 @@ ur_result_t urCommandBufferAppendMemBufferWriteExp(
407417
auto eventsWaitList = hCommandBuffer->getWaitListFromSyncPoints(
408418
pSyncPointWaitList, numSyncPointsInWaitList);
409419

420+
wait_list_view waitListView =
421+
wait_list_view(eventsWaitList, numSyncPointsInWaitList);
422+
410423
UR_CALL(commandListLocked->appendMemBufferWrite(
411-
hBuffer, false, offset, size, pSrc, numSyncPointsInWaitList,
412-
eventsWaitList, hCommandBuffer->createEventIfRequested(pSyncPoint)));
424+
hBuffer, false, offset, size, pSrc, waitListView,
425+
hCommandBuffer->createEventIfRequested(pSyncPoint)));
413426

414427
return UR_RESULT_SUCCESS;
415428
} catch (...) {
@@ -432,9 +445,12 @@ ur_result_t urCommandBufferAppendMemBufferReadExp(
432445
auto eventsWaitList = hCommandBuffer->getWaitListFromSyncPoints(
433446
pSyncPointWaitList, numSyncPointsInWaitList);
434447

448+
wait_list_view waitListView =
449+
wait_list_view(eventsWaitList, numSyncPointsInWaitList);
450+
435451
UR_CALL(commandListLocked->appendMemBufferRead(
436-
hBuffer, false, offset, size, pDst, numSyncPointsInWaitList,
437-
eventsWaitList, hCommandBuffer->createEventIfRequested(pSyncPoint)));
452+
hBuffer, false, offset, size, pDst, waitListView,
453+
hCommandBuffer->createEventIfRequested(pSyncPoint)));
438454

439455
return UR_RESULT_SUCCESS;
440456
} catch (...) {
@@ -461,10 +477,13 @@ ur_result_t urCommandBufferAppendMemBufferCopyRectExp(
461477
auto eventsWaitList = hCommandBuffer->getWaitListFromSyncPoints(
462478
pSyncPointWaitList, numSyncPointsInWaitList);
463479

480+
wait_list_view waitListView =
481+
wait_list_view(eventsWaitList, numSyncPointsInWaitList);
482+
464483
UR_CALL(commandListLocked->appendMemBufferCopyRect(
465484
hSrcMem, hDstMem, srcOrigin, dstOrigin, region, srcRowPitch,
466-
srcSlicePitch, dstRowPitch, dstSlicePitch, numSyncPointsInWaitList,
467-
eventsWaitList, hCommandBuffer->createEventIfRequested(pSyncPoint)));
485+
srcSlicePitch, dstRowPitch, dstSlicePitch, waitListView,
486+
hCommandBuffer->createEventIfRequested(pSyncPoint)));
468487

469488
return UR_RESULT_SUCCESS;
470489
} catch (...) {
@@ -491,10 +510,12 @@ ur_result_t urCommandBufferAppendMemBufferWriteRectExp(
491510
auto eventsWaitList = hCommandBuffer->getWaitListFromSyncPoints(
492511
pSyncPointWaitList, numSyncPointsInWaitList);
493512

513+
wait_list_view waitListView =
514+
wait_list_view(eventsWaitList, numSyncPointsInWaitList);
515+
494516
UR_CALL(commandListLocked->appendMemBufferWriteRect(
495517
hBuffer, false, bufferOffset, hostOffset, region, bufferRowPitch,
496-
bufferSlicePitch, hostRowPitch, hostSlicePitch, pSrc,
497-
numSyncPointsInWaitList, eventsWaitList,
518+
bufferSlicePitch, hostRowPitch, hostSlicePitch, pSrc, waitListView,
498519
hCommandBuffer->createEventIfRequested(pSyncPoint)));
499520

500521
return UR_RESULT_SUCCESS;
@@ -522,10 +543,12 @@ ur_result_t urCommandBufferAppendMemBufferReadRectExp(
522543
auto eventsWaitList = hCommandBuffer->getWaitListFromSyncPoints(
523544
pSyncPointWaitList, numSyncPointsInWaitList);
524545

546+
wait_list_view waitListView =
547+
wait_list_view(eventsWaitList, numSyncPointsInWaitList);
548+
525549
UR_CALL(commandListLocked->appendMemBufferReadRect(
526550
hBuffer, false, bufferOffset, hostOffset, region, bufferRowPitch,
527-
bufferSlicePitch, hostRowPitch, hostSlicePitch, pDst,
528-
numSyncPointsInWaitList, eventsWaitList,
551+
bufferSlicePitch, hostRowPitch, hostSlicePitch, pDst, waitListView,
529552
hCommandBuffer->createEventIfRequested(pSyncPoint)));
530553

531554
return UR_RESULT_SUCCESS;
@@ -548,9 +571,12 @@ ur_result_t urCommandBufferAppendUSMFillExp(
548571
auto eventsWaitList = hCommandBuffer->getWaitListFromSyncPoints(
549572
pSyncPointWaitList, numSyncPointsInWaitList);
550573

574+
wait_list_view waitListView =
575+
wait_list_view(eventsWaitList, numSyncPointsInWaitList);
576+
551577
UR_CALL(commandListLocked->appendUSMFill(
552-
pMemory, patternSize, pPattern, size, numSyncPointsInWaitList,
553-
eventsWaitList, hCommandBuffer->createEventIfRequested(pSyncPoint)));
578+
pMemory, patternSize, pPattern, size, waitListView,
579+
hCommandBuffer->createEventIfRequested(pSyncPoint)));
554580
return UR_RESULT_SUCCESS;
555581
} catch (...) {
556582
return exceptionToResult(std::current_exception());
@@ -572,9 +598,12 @@ ur_result_t urCommandBufferAppendMemBufferFillExp(
572598
auto eventsWaitList = hCommandBuffer->getWaitListFromSyncPoints(
573599
pSyncPointWaitList, numSyncPointsInWaitList);
574600

601+
wait_list_view waitListView =
602+
wait_list_view(eventsWaitList, numSyncPointsInWaitList);
603+
575604
UR_CALL(commandListLocked->appendMemBufferFill(
576-
hBuffer, pPattern, patternSize, offset, size, numSyncPointsInWaitList,
577-
eventsWaitList, hCommandBuffer->createEventIfRequested(pSyncPoint)));
605+
hBuffer, pPattern, patternSize, offset, size, waitListView,
606+
hCommandBuffer->createEventIfRequested(pSyncPoint)));
578607

579608
return UR_RESULT_SUCCESS;
580609
} catch (...) {
@@ -598,8 +627,11 @@ ur_result_t urCommandBufferAppendUSMPrefetchExp(
598627
auto eventsWaitList = hCommandBuffer->getWaitListFromSyncPoints(
599628
pSyncPointWaitList, numSyncPointsInWaitList);
600629

630+
wait_list_view waitListView =
631+
wait_list_view(eventsWaitList, numSyncPointsInWaitList);
632+
601633
UR_CALL(commandListLocked->appendUSMPrefetch(
602-
pMemory, size, flags, numSyncPointsInWaitList, eventsWaitList,
634+
pMemory, size, flags, waitListView,
603635
hCommandBuffer->createEventIfRequested(pSyncPoint)));
604636

605637
return UR_RESULT_SUCCESS;
@@ -622,8 +654,11 @@ ur_result_t urCommandBufferAppendUSMAdviseExp(
622654
auto eventsWaitList = hCommandBuffer->getWaitListFromSyncPoints(
623655
pSyncPointWaitList, numSyncPointsInWaitList);
624656

657+
wait_list_view waitListView =
658+
wait_list_view(eventsWaitList, numSyncPointsInWaitList);
659+
625660
UR_CALL(commandListLocked->appendUSMAdvise(
626-
pMemory, size, advice, numSyncPointsInWaitList, eventsWaitList,
661+
pMemory, size, advice, waitListView,
627662
hCommandBuffer->createEventIfRequested(pSyncPoint)));
628663

629664
return UR_RESULT_SUCCESS;
@@ -672,15 +707,19 @@ ur_result_t urCommandBufferAppendNativeCommandExp(
672707
auto eventsWaitList = hCommandBuffer->getWaitListFromSyncPoints(
673708
pSyncPointWaitList, numSyncPointsInWaitList);
674709

675-
UR_CALL(commandListLocked->appendEventsWaitWithBarrier(
676-
numSyncPointsInWaitList, eventsWaitList, nullptr));
710+
wait_list_view waitListView =
711+
wait_list_view(eventsWaitList, numSyncPointsInWaitList);
712+
713+
UR_CALL(
714+
commandListLocked->appendEventsWaitWithBarrier(waitListView, nullptr));
677715

678716
// Call user-defined function immediately
679717
pfnNativeCommand(pData);
680718

719+
wait_list_view emptyWaitList = wait_list_view(nullptr, 0);
681720
// Barrier on all commands after user defined commands.
682721
UR_CALL(commandListLocked->appendEventsWaitWithBarrier(
683-
0, nullptr, hCommandBuffer->createEventIfRequested(pSyncPoint)));
722+
emptyWaitList, hCommandBuffer->createEventIfRequested(pSyncPoint)));
684723

685724
return UR_RESULT_SUCCESS;
686725
}

0 commit comments

Comments
 (0)