halide_notes

thesis

Introduction, Justification

the organization of computations and data for a given algorithm is constrained by a fundamental tension between parallelism, locality, and redundant computation of shared values
- This is addressed by a systematic model of schedules.
Stencil computation - repeated application of kernel across a dataset.
loop fusion
loop fusion + tiling slides
this is also why libraries of optimized code cannot deliver efficient performance when building real image processing pipelines: individually optimized subroutines do not compose into an optimized whole, since they cannot reorganize computation for locality or parallel execution across function boundaries.
I believe the right way to program image processing pipelines is to separate the intrinsic algorithm—what is computed—from the concerns of efficiently organizing it for machine execution—decisions about storage and the ordering of computation.
In particular, recursion is only allowed within a single function, using update definitions, and the recursion is bounded to a fixed depth before it begins by an explicit reduction domain. As a result, Halide’s language of algorithms is not Turing-complete, but is amenable to extensive analysis and transformation.
Today, relative to the energy cost of doing some arithmetic operation on a piece of data, loading or storing that data in a small local SRAM like a cache can be several times more expensive; moving the result 10 millimeters across the chip is an order of magnitude more expensive; and moving it to or from on-chip RAM is three to four orders of magnitude more expensive than computing the value in the first place. This disparity is only growing over time.
- CN - This matches my understanding precisely of the energy costs of NVIDIA hardware. NVIDIA studied this extensively because the one thing common between cellphones and supercomputers is that they are both power limited. It also opens up other fields of optimization (power optimization).

Language

Note that the language consists of two subsections, algorithm and schedule definitions.

Algorithm Definition

Func - a pure function, defining image values over some domain.
Var - a free variable in the domain of a function.
IterDom - a multi-dimension iteration domain, effectively an ordered list of bounded variables.[In the current implementation, what I here call an IterDom is instead named RDom.]
Expr - an expression defining the value of a function in terms of the free variables which make up its domain, as well as constants, iteration domains, parameters, and the application of other functions.
Image - an immutable reference to an external memory buffer, visible to the algorithm as a function which may be applied only over the finite domain given by the image’s dimensions.
Param - a runtime variable parameter, providing a scalar argument to the algorithm.
update-definitions - Beyond their initial pure definition, functions may have one or more recursive update definitions, which redefine the value at points given by an output coordinate expression in terms of prior values of the function.
guard-bands - since functions are defined over an infinite domain, boundary conditions can be handled safely and efficiently by computing arbitrary guard bands of extra values as needed. Guard bands are a common pattern in image processing code, both for performance concerns like alignment, and for safety.

Language Concepts

When the four output, two-stage blur is split into two tiles, we introduced five pixels of redundant excess computation. In general, the amount of excess computation is proportional both to the amount of reuse in the original graph—the number of consumer tasks which depend on each producer, given by the size of the stencil—and also to the granularity at which the graph is split.

Together, parallelism, locality, and the total amount of computation are fundamentally in tension in the organization of computation. (this is not a limitation of the series-parallel structure, but applies similarly to any parallel organization.). In practice, the ideal organization for a given algorithm on a given architecture must balance all three, the possible choices for how to do so are combinatorially complex, and the best choice subtle and unpredictable.

The work, defined as the number of leaf tasks, gives the total size of the problem to be computed.
Conversely, the span (or critical path) of the whole graph—the length of the longest path which cannot be executed in parallel—gives an upper bound on the parallel speedup assuming infinite parallel processors.

Given these two terms, the amount of parallelism in a given organization can be understood as the work/span ratio.

The intrinsic algorithm defines a basic graph of tasks connected by dependence edges.
The organization optionally duplicates nodes to split shared dependence edges; hierarchically groups tasks for nested parallel execution; overlays this graph with ordering edges; and finally overlays the graph with reuse edges, mapping the output of tasks to shared memory locations.

Locality relates to the reuse distance between dependent tasks along the order of computation.
Parallelism is summarized by the work span ratio in the scheduled graph.
Task granularity is given by the work within a single parallel task.
Redundant work corresponds to the number of nodes executed relative to the minimum number required by the intrinsic algorithm.
Storage footprint corresponds to the number of nodes without incoming reuse edges.
Allocation granularity relates to the ratio of storage footprint to the number of unique allocation groups.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

halide_notes

Introduction, Justification

Language

Algorithm Definition

Language Concepts

Clone this wiki locally