This is an exercise to show NPP capabilities by applying a Canny Border Filter to one or more input images in PGM format. For each input image, the filter is applied by calling the NPP function nppiFilterCannyBorder_8u_C3C1R_Ctx
and the output is stored into a PMG image with the same name and a boxed_
prefix.
This implementation exploits the available concurrency on a CUDA device by duble-buffering the execution of the various operations. For more details about the double buffering implementation, see the dedicated section.
For the basic functionalities, a Makefile
is available. It automatically downloads the dependencies (see the list of dependencies) and builds the executable when needed and requires no manual intervention nor extra arguments.
To compile and run on a single image, you may run
make test
To compile and run on an entire set of images, you may run
make test_dir
This command:
- builds the executable
edge_detect.x
- creates the directory
images/input
- extracts the PGM images inside
images/images.tar.gz
intoimages/input
- creates the directory
images/output
- runs
edge_detect.x
in the directoryimages/input
with output intoimages/output
To just compile
make build
To clean everything
make distclean
To clean only the built binaries
make clean
To see the synopsis
$ ./edge_detect.x --help
Usage: edge_detect [--help] [-o VAR] [--batch VAR] [--dir] input
Positional arguments:
input input file or directory (see '--dir' option)
Optional arguments:
-h, --help shows help message and exits
-o output file or directory (depending on input) [nargs=0..1] [default: "."]
--batch batch size (none means decide from hardware)
--dir input is a directory
To generate the code documentation (via doxygen) run
make doxygen
which will produce it in HTML format, with home at docs/html/index.html
.
If you run make test_dir
, you may see the following
At the beginning you see the compilation and the run commands, while at the end you see the output images in images/output
, whose names correspond to the input images (middle ls images/input
command) with a boxed_
prefix.
The structure of the repository follows the C++ Canonical Project Structure.
The code is formatted via make clang-format
and checked via make clang-tidy
.
Processing a single image essentially consists in 3 steps, run in the following order:
LOAD
: loading it from file and initializing the related NPP objects (memory alloactions, data copies, ...);PROCESS
: sending the data to the device, running the computation and loading results into main memory;STORE
: storing the result image into a file.
The current implementation runs these steps in a double buffered way, to overlap the operations LOAD
and PROCESS
(run together) with 3. Furthermore, multiple operations are sent concurrently to the CUDA device in an asynchronous fashion via CUDA streams, thus leveraging the available parallelism. In particular, the user can control the number of parallel operations via the command-line argument --batch <BATCH_SIZE>
, where <BATCH_SIZE>
is the desired batch size, i.e., the number of operations of type LOAD
-and-PROCESS
or STORE
issued to the device. The algorithm is summarised by the following Python-like pseudo code:
# user's inputs
IMAGES := [0, N) list of N images to be processed
BATCH_SIZE := size of a single batch
BATCHES = split IMAGES in groups of at most BATCH_SIZE images
# prologue: send first batch to device
BATCH = BATCHES[0]
for IMAGE in BATCH:
LOAD(IMAGE)
PROCESS(IMAGE)
# main loop: process following batches
for BATCH_NUM in [1, len(BATCHES)):
# LOAD-PROCESS phase: send new batch
BATCH = BATCHES[BATCH_NUM]
for IMAGE in BATCH:
LOAD(IMAGE)
PROCESS(IMAGE)
# STORE phase: gather result from previous batch and store them
PREV_BATCH = BATCHES[BATCH_NUM-1]
for IMAGE in PREV_BATCH:
STORE(IMAGE)
# epilogue: gather results of last sent batch and store
PREV_BATCH = BATCHES[len(BATCHES)-1]
for IMAGE in PREV_BATCH:
STORE(IMAGE)
Keeping track of in-flight operations requires intermediate data structures: <BATCH_SIZE>
data structures are written in the LOAD-PROCESS phase (iteration BATCH_NUM
) to keep track of images sent to the device for processing; <BATCH_SIZE>
data structures are read during the STORE phase, to gather the results of the previous LOAD-PROCESS phase (iteration BATCH_NUM-1
) and store them on disk. At the end of each BATCH_NUM
iteration, these data structures are swapped, so that the first group of them (written in the LOAD-PROCESS phase) is read during the following STORE phase and the second group (read in the STORE phase and thus not needed anymore) is re-used in the following LOAD-PROCESS phase.
To inspect the output, you may want to convert it from PGM to PNG using ImageMagick
magick <file name>.pgm <file name>.png
The images inside images/images.tar.gz
(used as input) were downloaded from https://sipi.usc.edu/database/database.php?volume=sequences (where they are freely available) and converted to PGM in order to avoid flooding warnings.
Note
To convert each image, the following command was used
magick <file name>.tiff -depth 8 <file name>.pgm
The argparse
library is available under MIT License.
The cuda-samples
(v13.0) are required to compile the executable and are available under their own license.