Skip to content

Commit bf71ea5

Browse files
authored
add more capture replay controls (#337)
* minor changes to kernel arg maps * add more capture replay controls * simplify capture replay controls * move image metadata capturing * fix capture replay scripts * fix CL_PROGRAM_BINARIES query * verified image capture and playback is working * fix copyright date after rebase * fix docs and tidy up a few more things * remove stale comment * disable logging in several cases when capture is skipped These were a little too verbose in common cases. * move buffer and image dumping for replay back into replay directory
1 parent ed72581 commit bf71ea5

File tree

8 files changed

+895
-682
lines changed

8 files changed

+895
-682
lines changed

docs/capture_single_kernels.md

Lines changed: 33 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -17,52 +17,61 @@ To replay the captured kernels, you will need the following Python packages:
1717

1818
## Step by Step for Automatic Capturing
1919

20-
* Set one of the two controls:
21-
* `DumpReplayKernelName`, if you want to capture a kernel by its name.
22-
* `DumpReplayKernelEnqueue`, if you want to capture a kernel by its enqueue number.
23-
* Then, simply run the program as usual!
24-
* Example on Linux: `CLI_DumpReplayKernelName=${NameOfKernel} cliloader /path/to/executable`
20+
1. Set the top-level control to enable kernel capturing and replay: `CaptureReplay`
21+
2. Set any additional controls to capture a specific range of kernels, or specific kernel names. For example:
22+
* `CaptureReplayMinEnqueue` and `CaptureReplayMaxEnqueue`, to capture a specific range of kernel enqueues.
23+
* `CaptureReplayKernelName`, to capture a specific kernel name.
24+
* `CaptureReplayUniqueKernels`, to capture only unique kernel and dispatch parameter combinations.
25+
* `CaptureReplayNumKernelEnqueuesSkip`, to skip initial captures.
26+
* `CaptureReplayNumKernelEnqueuesCapture`, to capture a limited number of kernel enqueues.
27+
3. Then, simply run the program as usual!
28+
29+
For more details, please see the Capture and Replay Controls section in the [controls](controls.md) documentation.
2530

2631
## Step by Step for Automatic Capturing and Validation
2732

28-
* Copy the [capture_and_validate.py](../scripts/capture_and_validate.py) script to the place where you run the app from.
29-
* Not strictly necessary, but makes life easier.
30-
* Run this script with the following arguments:
31-
- One of `--num EnqueueNumberToBeCaptured` or `--name NameOfKernelToBeCaptured`
32-
- `-cli "/path/to/cliloader"`
33-
- `--p "/path/to/program"`
34-
- `--a ArgsForProgram`
33+
Use the [capture_and_validate.py](../scripts/capture_and_validate.py) script to capture a workload and validate that the replayed results match.
34+
35+
Arguments for the capture and validate script are:
3536

36-
Please make sure to follow this order of arguments!
37+
* `-c` or `--cliloader`: Path to `cliloader`. This can be a full path, or a relative path, or just `cliloader` if `cliloader` is already in the system path.
38+
* `-p` or `--program`: The command to execute the program to capture.
39+
* `-a` or `--args`: Any optional arguments to pass to the program to capture.
40+
* Either one of:
41+
* `-k` or `--kernel_name`: The kernel name to capture.
42+
* `-n` or `--enqueue_number`: The enqueue number that should be captured.
3743

38-
This will then run the program using `cliloader` with the given arguments, capture the the specified kernel, and verify that the buffers calculated by the standalone replay agree with the buffers calculated by the original program.
44+
The capture and validate script will then run the program using `cliloader` with the given arguments to capture the the specified kernel or enqueue number.
45+
The script will then verify that the buffers calculated by the standalone replay agree with the buffers calculated by the original program.
3946
If the buffers don't agree, it will show a message in the terminal.
4047

4148
## Supported Features
4249

4350
* OpenCL Buffers
4451
* These may be aliased, then only one buffer is used.
4552
* Only true if the buffers use the same memory address, so not when using sub-buffers and having offsets.
46-
* `__local` kernel arguments, i.e. those set by `clSetKernelArg(kernel, arg_index, local_size, nullptr)`.
53+
* `__local` kernel arguments, i.e. those set by `clSetKernelArg(kernel, arg_index, local_size, NULL)`.
4754
* Device only buffers, i.e. those with `CL_MEM_HOST_NO_ACCESS`. When kernel capture is enabled, any device-only access flags are removed.
4855
* OpenCL Images
56+
* 2D, and 3D images are supported.
4957
* OpenCL Samplers
50-
* Build/replay from source
51-
* Build/replay from a device binary
58+
* OpenCL Kernels from source or IL
59+
* OpenCL Kernels from device binary
5260

5361
## Limitations (incomplete)
5462

55-
* Does not work with OpenCL pipes
56-
* Untested for out-of-order queues
57-
* Sub-buffers are not dealt with explicitly, this may affect the results for both debugging and performance
58-
* The capture and validate script doesn't work with GUI apps
63+
* Does not work with OpenCL SVM or USM.
64+
* Does not work with OpenCL pipes.
65+
* Untested for out-of-order queues.
66+
* Sub-buffers are not dealt with explicitly, this may affect the results for both debugging and performance.
67+
* The capture and validate script may not work with some GUI apps.
5968

6069
## Advice
6170

62-
* Use the following environment variables for `pyopencl`: `PYOPENCL_NO_CACHE=1` and `PYOPENCL_COMPILER_OUTPUT=1`
63-
* Minimize usage of other controls, to prevent unexpected behavior.
71+
* Use the following environment variables for `pyopencl`: `PYOPENCL_NO_CACHE=1` and `PYOPENCL_COMPILER_OUTPUT=1`.
72+
* Minimize usage of other controls, to prevent unexpected behavior, however:
6473
* Consider enabling `InitializeBuffers` for more predictable results between runs.
65-
* Only set one of `DumpReplayKernelName` and `DumpReplayKernelEnqueue`.
74+
* When executing the capture and validate script consider removing any other kernel captures, or verifying that the validate script is using the correct capture.
6675
* Always make sure to check if your results make sense.
6776
* For some apps using `cliloader` doesn't work properly. If this happens for your application, please try other [install](install.md) options.
6877

docs/controls.md

Lines changed: 30 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -477,14 +477,6 @@ If set to a nonzero value, the Intercept Layer for OpenCL Applications will dump
477477

478478
If set to a nonzero value, the Intercept Layer for OpenCL Applications will dump kernel ISA binaries for every kernel, if supported. Currently, kernel ISA binaries are only supported for Intel GPU devices. Kernel ISA binaries can be decoded into ISA text with a disassembler. The filename will have the form "CLI\_\<Program Number\>\_\<Unique Program Hash Code\>\_\<Compile Count\>\_\<Unique Build Options Hash Code\>\_\<Device Type\>\_\<Kernel Name\>.isabin".
479479

480-
##### `DumpReplayKernelEnqueue` (int)
481-
482-
If set to a positive value, the Intercept Layer for OpenCL Applications will dump in /Replay/Enqueue\_*/ a standalone (i.e. runs completely independent from the original program from which is was captured) playable set of files for the specified enqueue number which can be used for debugging or profiling. When a program was build from source code, it will dump that one, otherwise it will dump the device binary. It is advised to not use this setting directly, but use /scripts/capture\_and\_validate.py.
483-
484-
##### `DumpReplayKernelName` (string)
485-
486-
If set, the Intercept Layer for OpenCL Applications for dump the specified kernel the first time it is encountered so that it can be replayed independently. It is advised to not use this setting directly, but use /scripts/capture\_and\_validate.py
487-
488480
### Controls for Emulating Features
489481

490482
##### `Emulate_cl_khr_extended_versioning` (bool)
@@ -613,6 +605,36 @@ If set to a nonzero value, the Intercept Layer for OpenCL Applications will try
613605

614606
If set to a nonzero value, the Intercept Layer for OpenCL Applications will try to automatically partition parent devices into sub-devices with the specified number of compute units.
615607

608+
### Capture and Replay Controls
609+
610+
##### `CaptureReplay` (bool)
611+
612+
This is the top-level control for kernel capture and replay.
613+
614+
##### `CaptureReplayMinEnqueue` (cl_uint)
615+
616+
The Intercept Layer for OpenCL Applications will only enable kernel capture and replay when the enqueue counter is greater than this value, inclusive.
617+
618+
##### `CaptureReplayMaxEnqueue` (cl_uint)
619+
620+
The Intercept Layer for OpenCL Applications will stop kernel capture and replay when the encounter is greater than this value, meaning that only enqueues less than this value, inclusive, will be captured.
621+
622+
##### `CaptureReplayKernelName` (string)
623+
624+
If set, the Intercept Layer for OpenCL Applications will only enable kernel capture and replay when the kernel name equals this name.
625+
626+
##### `CaptureReplayUniqueKernels` (bool)
627+
628+
If set, the Intercept Layer for OpenCL Applications will only enable kernel capture and replay if the kernel signature (i.e. hash + kernelname) has not been seen already.
629+
630+
##### `CaptureReplayNumKernelEnqueuesSkip` (cl_uint)
631+
632+
The Intercept Layer for OpenCL Applications will skip this many kernel enqueues before enabling kernel capture and replay.
633+
634+
##### `CaptureReplayNumKernelEnqueuesCapture` (cl_uint)
635+
636+
The Intercept Layer for OpenCL Applications will only capture this many kernel enqueues.
637+
616638
### AubCapture Controls
617639

618640
##### `AubCapture` (bool)

intercept/scripts/run.py

Lines changed: 55 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,8 @@
1-
1+
#
22
# Copyright (c) 2023-2024 Intel Corporation
33
#
44
# SPDX-License-Identifier: MIT
5+
#
56

67
import numpy as np
78
import pyopencl as cl
@@ -18,9 +19,16 @@ def get_image_metadata(idx: int):
1819
with open(filename) as metadata:
1920
lines = metadata.readlines()
2021

21-
shape = [int(lines[0]),
22-
int(lines[1]),
23-
int(lines[2])]
22+
image_type = int(lines[8])
23+
if image_type in [cl.mem_object_type.IMAGE1D]:
24+
shape = [int(lines[0])]
25+
elif image_type in [cl.mem_object_type.IMAGE2D]:
26+
shape = [int(lines[0]), int(lines[1])]
27+
elif image_type in [cl.mem_object_type.IMAGE3D]:
28+
shape = [int(lines[0]), int(lines[1]), int(lines[2])]
29+
else:
30+
print('Unsupported image type for playback!')
31+
shape = [int(lines[0]), int(lines[1]), int(lines[2])]
2432

2533
format = cl.ImageFormat(int(lines[7]), int(lines[6]))
2634
return format, shape
@@ -42,6 +50,12 @@ def sampler_from_string(ctx, sampler_descr):
4250
help='How often the kernel should be enqueued')
4351
args = parser.parse_args()
4452

53+
# Read the enqueue number from the file
54+
with open('./enqueueNumber.txt') as file:
55+
enqueue_number = file.read().splitlines()[0]
56+
57+
padded_enqueue_num = str(enqueue_number).rjust(4, "0")
58+
4559
arguments = {}
4660
argument_files = gl.glob("./Argument*.bin")
4761
for argument in argument_files:
@@ -51,10 +65,11 @@ def sampler_from_string(ctx, sampler_descr):
5165
buffer_idx = []
5266
input_buffers = {}
5367
output_buffers = {}
54-
buffer_files = gl.glob("./Buffer*.bin")
68+
buffer_files = gl.glob("./Pre/Enqueue_" + padded_enqueue_num + "*.bin")
5569
input_buffer_ptrs = defaultdict(list)
5670
for buffer in buffer_files:
57-
idx = int(re.findall(r'\d+', buffer)[0])
71+
start = buffer.find("_Arg_")
72+
idx = int(re.findall(r'\d+', buffer[start:])[0])
5873
buffer_idx.append(idx)
5974
input_buffers[idx] = np.fromfile(buffer, dtype='uint8').tobytes()
6075
input_buffer_ptrs[arguments[idx]].append(idx)
@@ -63,10 +78,11 @@ def sampler_from_string(ctx, sampler_descr):
6378
image_idx = []
6479
input_images = {}
6580
output_images = {}
66-
image_files = gl.glob("./Image*.raw")
81+
image_files = gl.glob("./Pre/Enqueue_" + padded_enqueue_num + "*.raw")
6782
input_images_ptrs = defaultdict(list)
6883
for image in image_files:
69-
idx = int(re.findall(r'\d+', image)[0])
84+
start = image.find("_Arg_")
85+
idx = int(re.findall(r'\d+', image[start:])[0])
7086
image_idx.append(idx)
7187
input_images[idx] = np.fromfile(image, dtype='uint8').tobytes()
7288
input_images_ptrs[arguments[idx]].append(idx)
@@ -86,13 +102,12 @@ def sampler_from_string(ctx, sampler_descr):
86102

87103
# Check if all input pointer addresses are unique
88104
if len(tmp_args) != len(set(tmp_args)):
89-
print("Some of the buffers are aliasing, we will replicate this behavior")
105+
print("Some of the buffers are aliasing, we will replicate this behavior.")
90106

91107
ctx = cl.create_some_context()
92108
queue = cl.CommandQueue(ctx)
93109
devices = ctx.get_info(cl.context_info.DEVICES)
94110

95-
# TODO Samplers
96111
samplers = {}
97112
sampler_files = gl.glob("./Sampler*.txt")
98113
for sampler in sampler_files:
@@ -120,19 +135,19 @@ def sampler_from_string(ctx, sampler_descr):
120135
gpu_images[idx] = cl.Image(ctx, mf.COPY_HOST_PTR, format, shape, hostbuf=input_images[idx])
121136

122137
with open("buildOptions.txt", 'r') as file:
123-
flags = [line.rstrip() for line in file]
124-
print(f"Using flags: {flags}")
138+
options = [line.rstrip() for line in file]
139+
print(f"Using build options: {options}")
125140

126-
with open('knlName.txt') as file:
127-
knl_name = file.read()
141+
with open('kernelName.txt') as file:
142+
kernel_name = file.read()
128143

129144
if os.path.isfile("kernel.cl"):
130-
print("Using kernel source code")
145+
print("Using kernel source")
131146
with open("kernel.cl", 'r') as file:
132147
kernel = file.read()
133-
prg = cl.Program(ctx, kernel).build(flags)
148+
prg = cl.Program(ctx, kernel).build(options)
134149
else:
135-
print("Using device binary")
150+
print("Using kernel device binary")
136151
binary_files = gl.glob("./DeviceBinary*.bin")
137152
binaries = []
138153
for file in binary_files:
@@ -141,50 +156,49 @@ def sampler_from_string(ctx, sampler_descr):
141156
# Try the binaries to find one that works
142157
for idx in range(len(binaries)):
143158
try:
144-
prg = cl.Program(ctx, [devices[0]], [binaries[idx]]).build(flags)
145-
getattr(prg, knl_name)
159+
prg = cl.Program(ctx, [devices[0]], [binaries[idx]]).build(options)
160+
getattr(prg, kernel_name)
146161
break
147162
except Exception as e:
148163
pass
149164

150-
knl = getattr(prg, knl_name)
165+
kernel = getattr(prg, kernel_name)
151166
for pos, argument in arguments.items():
152-
knl.set_arg(pos, argument)
167+
kernel.set_arg(pos, argument)
153168

154169
for pos, buffer in gpu_buffers.items():
155170
for idx in pos:
156-
knl.set_arg(idx, buffer)
171+
kernel.set_arg(idx, buffer)
157172

158173
for pos, image in gpu_images.items():
159-
knl.set_arg(pos, image)
174+
kernel.set_arg(pos, image)
160175

161176
for pos, size in local_sizes.items():
162-
knl.set_arg(pos, cl.LocalMemory(size))
177+
kernel.set_arg(pos, cl.LocalMemory(size))
163178

164179
for pos, sampler in samplers.items():
165-
knl.set_arg(pos, sampler)
180+
kernel.set_arg(pos, sampler)
166181

167182
gws = []
168183
lws = []
169-
gws_offset = []
184+
gwo = []
170185

171186
with open("worksizes.txt", 'r') as file:
172187
lines = file.read().splitlines()
173188

174189
gws.extend([int(value) for value in lines[0].split()])
175190
lws.extend([int(value) for value in lines[1].split()])
176-
gws_offset.extend([int(value) for value in lines[2].split()])
191+
gwo.extend([int(value) for value in lines[2].split()])
177192

178-
print(f"Global Worksize: {gws}")
179-
print(f"Local Worksize: {lws}")
180-
print(f"Global Worksize Offsets: {gws_offset}")
193+
print(f"Global Work Size: {gws}")
194+
print(f"Local Work Size: {lws}")
195+
print(f"Global Work Offsets: {gwo}")
181196

182197
if lws == [0] or lws == [0, 0] or lws == [0, 0, 0]:
183198
lws = None
184199

185200
for _ in range(args.repetitions):
186-
cl.enqueue_nd_range_kernel(queue, knl, gws, lws, gws_offset)
187-
201+
cl.enqueue_nd_range_kernel(queue, kernel, gws, lws, gwo)
188202

189203
for pos in gpu_buffers.keys():
190204
if len(pos) == 1:
@@ -196,9 +210,15 @@ def sampler_from_string(ctx, sampler_descr):
196210
for pos in gpu_images.keys():
197211
cl.enqueue_copy(queue, output_images[pos], gpu_images[pos], region=shape, origin=(0,0,0))
198212

213+
if not os.path.exists("./Test"):
214+
os.makedirs("./Test")
215+
199216
for pos, cpu_buffer in output_buffers.items():
200-
cpu_buffer.tofile("output_buffer" + str(pos) + ".bin")
217+
outbuf = "./Test/Enqueue_" + padded_enqueue_num + "_Kernel_" + kernel_name + "_Arg_" + str(pos) + "_Buffer.bin"
218+
print(f"Writing buffer output to file: {outbuf}")
219+
cpu_buffer.tofile(outbuf)
201220

202221
for pos, cpu_image in output_images.items():
203-
cpu_image.tofile("output_image" + str(pos) + ".raw")
204-
222+
outimg = "./Test/Enqueue_" + padded_enqueue_num + "_Kernel_" + kernel_name + "_Arg_" + str(pos) + "_Image.raw"
223+
print(f"Writing image output to file: {outimg}")
224+
cpu_image.tofile(outimg)

0 commit comments

Comments
 (0)