You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Note: this is implemented but not launched. You can try it out using --xla_gpu_enable_mlir_emitters.
Context
Objective
Replace XLA:GPU's aging legacy emitters with a new maintainable, at least equally performant, more composable baseline codegen. This codegen should:
implement 100% of XLA:GPUs feature set (i.e., allow a complete removal of the legacy emitters)
be composable with other codegen strategies (in particular block-level codegen - e.g. Triton) and work on any GPU1.
Do this in with a reasonably-sized investment.
Background
XLA:GPU's legacy emitters are a constant maintenance burden. We estimate that during the priority fusion project, at least one person was permanently chasing down bugs and regressions caused by it. The major problems are:
Maintainability and understandability: The codegen is hard to understand, since it goes straight from HLO to LLVM. Changes take much longer than they should. Almost no one understands how codegen works. Almost everyone forgets immediately after working on it.
Brittleness: it is easy to run off the "happy path", causing various problems during both compile time and runtime (crashes, slowness, incorrect results)
Modeling: as mentioned above, the codegen is hard to understand, and likewise very difficult to model. Modeling the codegen accurately is crucial for cost modeling.
Long term, the hope is that these problems will be solved by new codegen strategies. However, there is no near-term solution in sight, and there is substantial skepticism about a full replacement being feasible in 2024. Meanwhile, the problems persist, slowing down other projects.
We propose to replace the legacy emitters completely and quickly. To address all of the problems above, we need to:
Integrate codegen with the indexing map mechanism that we use in modeling. If modeling and codegen are based on the same abstractions, there is no risk of them going out of sync.
Have a more gradual lowering from HLO to LLVM. The abstraction jump is too high to do in one step.
Have a reimplementation of all of XLA:GPU's elemental ops.
The obvious choice for this is an MLIR pipeline. MLIR already has many lowerings for simple instructions (all of math). Indexing maps are based on MLIR's affine maps, which let us reuse some infrastructure. Block-level codegen uses MLIR, so this makes XLA:GPU less special.
Design
Overview
Today, XLA:GPU goes in one abstraction step from HLO to LLVMIR:
While there are some C++ classes to aid in this conversion, they are fairly limited (mostly helping slightly with indexing) and they do not compose at all. This abstraction step is too big, slowing us down and reducing the quality of the codebase overall.
The core of this proposal is to introduce a more gradual lowering. We propose the following steps:
Partitioning the HLO execution graph into functions.
Lowering to a tensor-level program. Note: not part of the initial project: we lower directly from HLO to scalars. This will be needed in the future for block-level codegen.
Tiling the program (tensor to tensor).
Tiling the program (tensor to scalars) and emitting scalar MLIR (SCF + tensors).
Gradually lowering to LLVM
Indexing optimization
Function inlining
Math approximation
Lowering tensors to memory instructions
Detailed design
Graph partitioning
Consider a fusion like the one depicted above. Each pad and slice op changes the indexing of its input operand, so operands cannot be reused if they have multiple users. In this particular example, this is the case for add.16889 and add.16885. Since the latter is a transitive user of the former, the former's operands will be emitted at least four times using the approach employed by the current emitters.
There are many ops that trigger this behavior: slice, pad, gather, reshape and more.
This is the major cause of exponential code-size and compile-time issues in the legacy emitter infrastructure. Even though all the shapes here are tiny (so the code would run fast enough), we cannot emit fusions like this, because compilation would be unbearably slow.
The new emitters employ a different strategy: instead of recursively generating code for each node, we first partition the graph into individual functions that can be emitted together without emitting any node more than once. These functions are then connected by function calls.
In our graph partitioning, we place each HLO node into exactly one subgraph (or function). The subgraphs are built starting from the root of the fusion. The rules are simple:
If all users of an instruction are in the same subgraph (= use the same indexing), the instruction is placed in the same subgraph.
Otherwise, the instruction starts a new subgraph.
Some instructions change the indexing (e.g. transpose has a different output than input indexing). We do not currently track the exact indexing in this step; all indexings are assumed to be unique.
Some users can use the same operand more than once, in which case the uses usually have different indexing. For example, gather can use the index operand more than once if there's more than one index dimension, but the uses will be at different coordinates.
In this example, the lower function will contain two calls of the upper function.
Tiling to points / initial codegen
Subgraphs map straightforwardly to MLIR functions. We use the following signature:
That is, there is one tensor parameter per fusion parameter (regardless of whether or not the parameter is actually used by the function) and one index parameter per output dimension. The semantics are: compute the value of the root node(s) at the given indices.
The codegen for function bodies is straightforward and will not be described here. Instead we refer to the implementation (look for SubgraphToMlir).
Epilogue fusion
The signatures above do not work as-is for epilogues of input fusions: fusions where the input indexing is not defined by the root of the fusion. Because the iteration space is defined by the inputs of the hero (the reduction in this case), we cannot simply invoke a function for the root - it only contains 8192 elements in this case, limiting the achievable parallelism and resulting in poor memory access patterns. Instead, we isolate and inject the reduction - a fancy name for "it becomes a function parameter for the epilogue subgraph". The signature for the negation subgraph therefore becomes:
This approach is used for all fusions that customize the iteration order and support epilogues (transpose, reduce, concatenate).
Because it is possible for instructions to be both part of the epilogue and the prologue of a hero (e.g. reduction -> broadcast -> reduction, with both reductions being roots), we duplicate the instructions in the epilogue and emit them twice.
Inlining
The approach above sometimes results in functions that are too small. For example, sequences of transpose and broadcast and some reshapes result in functions that only permute the indexing. The partitioning code never duplicates HLO nodes into multiple subgraphs, since it knows nothing about the implementation of instructions. Instead, we rely on function inlining to later merge these functions into their callers. To avoid code bloat, we only merge functions that are small or are called only once. Inlining is done in postorder, so we would inline other functions into the called function before deciding whether to inline it.
Indexing optimization
The optimization of index computations is a core component of emitters. Not just because of the direct cost of the computations, but also because clean index computations are crucial for optimizations like load-store vectorization and loop-invariant code motion to work efficiently, in particular in the presence of bounds checks.
We have identified the following core optimizations:
Arithmetic simplification. Of particular importance here are the simplification of modulus and divisions, extracting symbols representing loop variables from them. For example, in a typical simple index computation for the minor dimension like this: (thread_id * 4 + s0) % 1024, where s0 represents the loop variable with a domain of [0;4), s0 can be extracted from the expression, yielding (thread_id * 4) % 1024 + s0. This enables LLVM's load-store vectorizer to optimize loads.
Range-based simplification. We emit many bounds checks that can later be optimized away, because the range of indices is known.
Choosing the correct index type. Using i64 indices only when strictly necessary can result in significantly simpler code.
Merging index computations: this is mostly done automatically.
Tensor lowerings
To keep as many optimization opportunities, we keep the program functional as long as possible. Therefore, the program uses tensors with value semantics until they are lowered to LLVM. We only use a handful of instructions:
tensor.insert and tensor.extract (and our custom versions of them: xla_gpu.predicated_insert and xla_gpu.predicated_extract).
xla_gpu.allocate_shared. This instruction allocates a shared memory tile, which semantically is a distributed tensor.
xla_gpu.sync_threads. This instruction lowers to a barrier and synchronization point. Logically, it also synchronizes all instructions that happen on distributed tensors.
xla_gpu.atomic_rmw. This instruction lowers to an CAS-loop or to a simpler atomic operations, if they are supported.
We lower tensor.insert and tensor.extract directly to LLVM.load and LLVM.store ops. allocate_shared lowers to a global memory buffer.
Emitters
This section gives a high-level overview of each of the new emitters. We only describe parts that are interestingly different from the legacy emitters in detail. Likewise, many components are shared between the emitters; they will only be described once.
Loop
The new loop emitter (like all new emitters) uses the existing thread to output indexing maps (ComputeThreadIdToOutputIndexing) for codegen. The core of the emitter looks like this:
The loop is generated from the output indexing map. In general, it will also include a bounds check for the output. Output tensors are passed to the function (destination-passing style) and are passed through the loops using iter args, with each iteration updating one particular element of the tensor. This is not a limitation of the lowerings: there can be multiple insertions per iteration, the loop emitter simply does not make use of this. There can be multiple output tensors.
Vectorization works similarly to the code shown above: an inner loop that accesses consecutive elements of the input/output is generated. Thanks to our indexing simplifications, nothing else is needed: the loops will automatically be vectorized.
All subgraph functions are generated automatically using the mechanism described above.
Transpose
The transpose emitter is similar to the loop emitter. The only difference is that there is a shared memory step between the read and write.
In the read step, consecutive threads process consecutive (physically) elements of the input and write the results to shared memory. In the write step, threads process the epilogue and write consecutive (physically) elements. The core of the emitter looks like this:
func.func @transpose_fusion(%input: tensor<128x64xf32>, %output: tensor<64x128xf32>)
-> tensor<64x128xf32> {
%thread_id = gpu.thread_id x
%shared_tile = xla_gpu.allocate_shared : tensor<32x32xf32>
...
%filled_shared = scf.for %i = %c0 to %c8 step %c1 iter_args (%arg0 = %shared_tile)
-> tensor<128x64xf32> {
(compute indexing and element like in the loop emitter)
%updated_output = tensor.insert %result into %arg0[%shared_y, %shared_x] : ...
scf.yield %updated_output : tensor<128x64xf32>
}
(synchronize all threads; conceptually this merges the writes that happened before
into a single tensor. It simply lowers to a barrier and sync; no data is
transferred.)
%synced = xla_gpu.sync_threads %filled_shared : tensor<32x32xf32>
%updated_output = scf.for %i = %c0 to %c8 step %c1 iter_args (%arg0 = %output)
-> tensor<128x64xf32> {
(compute indexing like in the loop emitter)
%shared_element = tensor.extract %synced[%shared_x, %shared_y] : tensor<32x32xf32>
(Evaluate the epilogue like described in epilogue fusion)
%result = xla_gpu.pure_call @epilogue(%input, %output_y, %output_x, %shared_element)
%updated_output = tensor.insert %result into %arg0[%out_y, %out_x] : ...
scf.yield %updated_output : tensor<64x128xf32>
}
return %ret : tensor<64x128xf32>
}
This is very close to the legacy transpose emitter, but it supports vectorizing loads and stores. Later, we will move towards direct tiled codegen for this.
Reduce
There are three different reduce emitters: row, multi-row and column. They differ in which threads combine elements and how shared memory is used.
The current algorithm works as follows:
Each thread loops over a subshape of the input and reduces elements within the subshape.
Row-reduction only: the elements in the warp are reduced to one (or in the case of multi-row reductions, n)
The result is written to shared memory (row, column reduction) or global memory, after applying the epilogue (multi-row reduction)
Partial results are loaded from shared memory (transposed in the case of column reductions)
Elements in the warp are again reduced to one
The epilogue is evaluated
The result is written to global memory, using atomics if more than one block works on an element of the output.
This way, reads are always coalesced; writes may be in the case of multi-row reductions. So in summary, the new reduction emitters work as follows:
Row reduction:
Each thread processes n elements of the input and produces one output element. The result is kept in registers.
A warp shuffle reduces 32 elements to one. Lane 0 writes the result to shared memory.
The first warp reads up to 32 elements from shared memory. Another warp shuffle reduces these elements to one. Lane 0 writes the result to global memory.
Multi-row reduction:
Like above, but the first warp shuffle reduces 32 elements to 2,4,8 or 16. The result is written to global memory instead of shared memory (since there is no second shuffle step).
Column reduction:
Each thread processes n elements of the input and produces one output element. The result is written to shared memory, but transposed.
A warp shuffle reduces up to 32 elements from shared memory. Lane 0 writes the result to global memory.
There are some limitations in the current codegen: shared memory is used even when a row reduction is handled by a single warp and there is no multi-column reduction (similar to the multi-row reduction). The runtime impact of these is small but measurable (1-2%). In the new emitters, we fixed the first issue, since it is much more straightforward thanks to the more high-level lowering.
Scatter
The scatter emitter maps every element of the updates to a single thread. For every element we compute the corresponding scatter index and then combine the update element with the element of the output. If the indices are not unique, then we emit an xla::gpu::AtomicRMWOp, otherwise a non-concurrent update is feasible.
Here is an example of the actual IR that is generated by the scatter emitter.
Concatenate, DUS and input-slices are all manifestations of the same idea: basically use the loop emitter, but use a different iteration space.
Future evolution
The concepts described above straightforwardly extend to tensor lowerings and tiling.
Lowering to tensors
Lowering to tensors is a necessary step for the further simplification of the transpose and reduce emitters and an integration with block-level codegen backends. The graph partitioning can be kept as is; instead of emitting scalar code, we emit MHLO instead. The elemental MLIR emitter will evolve into logic to specialize a tensor function to a scalar function instead.
Tiling
Once we have a tensor program, tiling is just a matter of specializing functions for particular tensor sizes and propagating indexing maps. Instead of indices we can outline functions with offsets-sizes-strides arguments. Some operations cannot be tiled arbitrarily with reasonable indexing maps (e.g. certain reshapes). When the fusion clusters are formed, we can compute the constraints on the slice parameters.
Dependencies
This project takes on no new dependencies. We already depend on MLIR via Triton emitters. The total code size will be less than before.
CPU
Once the GPU emitters are migrated, XLA:CPU will be the only user of elemental_ir_emitter. While elemental_ir_emitter is not a significant maintenance burden (less than one non-rollback, non-cleanup CL per month in the last year), it might be a good idea to migrate CPU to the new infrastructure that we build here. We expect this to just work, but no significant time was spent on it so far, since it is outside the responsibility of the team. However, we're keeping this constraint in mind and avoid making any decisions that would make it harder to adopt the new emitters on CPU.
Quality attributes
Runtime performance
The goal of this project is performance parity. We expect some improvements nevertheless, because of:
More efficient index computations, better CSE
Better vectorization (through optimizations, but also explicit vectorization in the reduce and transpose emitters)
More efficient reductions by avoiding shared memory
Speculative: More efficient launch grids
Compile time performance
The current fusion passes carefully avoid pathological compile time cases. Therefore, there will be no compile time improvement right off the bat. Nevertheless, there are opportunities for compile time improvements in the future:
Simplified analyses in fusion: no need to estimate codegen cost anymore - runtime cost is the limiting factor in all cases.
Shared indexing analyses for fusion and codegen.
Testability
The new codegen is more testable than the old, since the initial lowering is more high-level. We use standard filecheck tests.
Launch plans
We plan a gradual launch (emitter by emitter), starting with simple fusions (loop emitter), followed by more complex ones (tiling emitters). For each phase, we will test the changes against a large set of Google-internal benchmarks. We will also give clients sufficient time between these stages to test their workloads.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
New Emitters for XLA:GPU
Note: this is implemented but not launched. You can try it out using
--xla_gpu_enable_mlir_emitters
.Context
Objective
Replace XLA:GPU's aging legacy emitters with a new maintainable, at least equally performant, more composable baseline codegen. This codegen should:
Background
XLA:GPU's legacy emitters are a constant maintenance burden. We estimate that during the priority fusion project, at least one person was permanently chasing down bugs and regressions caused by it. The major problems are:
Long term, the hope is that these problems will be solved by new codegen strategies. However, there is no near-term solution in sight, and there is substantial skepticism about a full replacement being feasible in 2024. Meanwhile, the problems persist, slowing down other projects.
We propose to replace the legacy emitters completely and quickly. To address all of the problems above, we need to:
The obvious choice for this is an MLIR pipeline. MLIR already has many lowerings for simple instructions (all of math). Indexing maps are based on MLIR's affine maps, which let us reuse some infrastructure. Block-level codegen uses MLIR, so this makes XLA:GPU less special.
Design
Overview
Today, XLA:GPU goes in one abstraction step from HLO to LLVMIR:
→
While there are some C++ classes to aid in this conversion, they are fairly limited (mostly helping slightly with indexing) and they do not compose at all. This abstraction step is too big, slowing us down and reducing the quality of the codebase overall.
The core of this proposal is to introduce a more gradual lowering. We propose the following steps:
Detailed design
Graph partitioning
Consider a fusion like the one depicted above. Each pad and slice op changes the indexing of its input operand, so operands cannot be reused if they have multiple users. In this particular example, this is the case for
add.16889
andadd.16885
. Since the latter is a transitive user of the former, the former's operands will be emitted at least four times using the approach employed by the current emitters.There are many ops that trigger this behavior: slice, pad, gather, reshape and more.
This is the major cause of exponential code-size and compile-time issues in the legacy emitter infrastructure. Even though all the shapes here are tiny (so the code would run fast enough), we cannot emit fusions like this, because compilation would be unbearably slow.
The new emitters employ a different strategy: instead of recursively generating code for each node, we first partition the graph into individual functions that can be emitted together without emitting any node more than once. These functions are then connected by function calls.
In our graph partitioning, we place each HLO node into exactly one subgraph (or function). The subgraphs are built starting from the root of the fusion. The rules are simple:
Some users can use the same operand more than once, in which case the uses usually have different indexing. For example, gather can use the index operand more than once if there's more than one index dimension, but the uses will be at different coordinates.
In this example, the lower function will contain two calls of the upper function.
Tiling to points / initial codegen
Subgraphs map straightforwardly to MLIR functions. We use the following signature:
That is, there is one tensor parameter per fusion parameter (regardless of whether or not the parameter is actually used by the function) and one index parameter per output dimension. The semantics are: compute the value of the root node(s) at the given indices.
The codegen for function bodies is straightforward and will not be described here. Instead we refer to the implementation (look for
SubgraphToMlir
).Epilogue fusion
The signatures above do not work as-is for epilogues of input fusions: fusions where the input indexing is not defined by the root of the fusion. Because the iteration space is defined by the inputs of the hero (the reduction in this case), we cannot simply invoke a function for the root - it only contains 8192 elements in this case, limiting the achievable parallelism and resulting in poor memory access patterns. Instead, we isolate and inject the reduction - a fancy name for "it becomes a function parameter for the epilogue subgraph". The signature for the negation subgraph therefore becomes:
This approach is used for all fusions that customize the iteration order and support epilogues (transpose, reduce, concatenate).
Because it is possible for instructions to be both part of the epilogue and the prologue of a hero (e.g. reduction -> broadcast -> reduction, with both reductions being roots), we duplicate the instructions in the epilogue and emit them twice.
Inlining
The approach above sometimes results in functions that are too small. For example, sequences of transpose and broadcast and some reshapes result in functions that only permute the indexing. The partitioning code never duplicates HLO nodes into multiple subgraphs, since it knows nothing about the implementation of instructions. Instead, we rely on function inlining to later merge these functions into their callers. To avoid code bloat, we only merge functions that are small or are called only once. Inlining is done in postorder, so we would inline other functions into the called function before deciding whether to inline it.
Indexing optimization
The optimization of index computations is a core component of emitters. Not just because of the direct cost of the computations, but also because clean index computations are crucial for optimizations like load-store vectorization and loop-invariant code motion to work efficiently, in particular in the presence of bounds checks.
We have identified the following core optimizations:
(thread_id * 4 + s0) % 1024
, wheres0
represents the loop variable with a domain of [0;4),s0
can be extracted from the expression, yielding(thread_id * 4) % 1024 + s0
. This enables LLVM's load-store vectorizer to optimize loads.Tensor lowerings
To keep as many optimization opportunities, we keep the program functional as long as possible. Therefore, the program uses tensors with value semantics until they are lowered to LLVM. We only use a handful of instructions:
tensor.insert
andtensor.extract
(and our custom versions of them:xla_gpu.predicated_insert
andxla_gpu.predicated_extract
).xla_gpu.allocate_shared
. This instruction allocates a shared memory tile, which semantically is a distributed tensor.xla_gpu.sync_threads
. This instruction lowers to a barrier and synchronization point. Logically, it also synchronizes all instructions that happen on distributed tensors.xla_gpu.atomic_rmw
. This instruction lowers to an CAS-loop or to a simpler atomic operations, if they are supported.We lower
tensor.insert
andtensor.extract
directly to LLVM.load and LLVM.store ops. allocate_shared lowers to a global memory buffer.Emitters
This section gives a high-level overview of each of the new emitters. We only describe parts that are interestingly different from the legacy emitters in detail. Likewise, many components are shared between the emitters; they will only be described once.
Loop
The new loop emitter (like all new emitters) uses the existing thread to output indexing maps (
ComputeThreadIdToOutputIndexing
) for codegen. The core of the emitter looks like this:The loop is generated from the output indexing map. In general, it will also include a bounds check for the output. Output tensors are passed to the function (destination-passing style) and are passed through the loops using iter args, with each iteration updating one particular element of the tensor. This is not a limitation of the lowerings: there can be multiple insertions per iteration, the loop emitter simply does not make use of this. There can be multiple output tensors.
Vectorization works similarly to the code shown above: an inner loop that accesses consecutive elements of the input/output is generated. Thanks to our indexing simplifications, nothing else is needed: the loops will automatically be vectorized.
All subgraph functions are generated automatically using the mechanism described above.
Transpose
The transpose emitter is similar to the loop emitter. The only difference is that there is a shared memory step between the read and write.
In the read step, consecutive threads process consecutive (physically) elements of the input and write the results to shared memory. In the write step, threads process the epilogue and write consecutive (physically) elements. The core of the emitter looks like this:
This is very close to the legacy transpose emitter, but it supports vectorizing loads and stores. Later, we will move towards direct tiled codegen for this.
Reduce
There are three different reduce emitters: row, multi-row and column. They differ in which threads combine elements and how shared memory is used.
The current algorithm works as follows:
This way, reads are always coalesced; writes may be in the case of multi-row reductions. So in summary, the new reduction emitters work as follows:
There are some limitations in the current codegen: shared memory is used even when a row reduction is handled by a single warp and there is no multi-column reduction (similar to the multi-row reduction). The runtime impact of these is small but measurable (1-2%). In the new emitters, we fixed the first issue, since it is much more straightforward thanks to the more high-level lowering.
Scatter
The scatter emitter maps every element of the updates to a single thread. For every element we compute the corresponding scatter index and then combine the update element with the element of the output. If the indices are not unique, then we emit an
xla::gpu::AtomicRMWOp
, otherwise a non-concurrent update is feasible.Here is an example of the actual IR that is generated by the scatter emitter.
Others
Concatenate, DUS and input-slices are all manifestations of the same idea: basically use the loop emitter, but use a different iteration space.
Future evolution
The concepts described above straightforwardly extend to tensor lowerings and tiling.
Lowering to tensors
Lowering to tensors is a necessary step for the further simplification of the transpose and reduce emitters and an integration with block-level codegen backends. The graph partitioning can be kept as is; instead of emitting scalar code, we emit MHLO instead. The elemental MLIR emitter will evolve into logic to specialize a tensor function to a scalar function instead.
Tiling
Once we have a tensor program, tiling is just a matter of specializing functions for particular tensor sizes and propagating indexing maps. Instead of indices we can outline functions with offsets-sizes-strides arguments. Some operations cannot be tiled arbitrarily with reasonable indexing maps (e.g. certain reshapes). When the fusion clusters are formed, we can compute the constraints on the slice parameters.
Dependencies
This project takes on no new dependencies. We already depend on MLIR via Triton emitters. The total code size will be less than before.
CPU
Once the GPU emitters are migrated, XLA:CPU will be the only user of elemental_ir_emitter. While elemental_ir_emitter is not a significant maintenance burden (less than one non-rollback, non-cleanup CL per month in the last year), it might be a good idea to migrate CPU to the new infrastructure that we build here. We expect this to just work, but no significant time was spent on it so far, since it is outside the responsibility of the team. However, we're keeping this constraint in mind and avoid making any decisions that would make it harder to adopt the new emitters on CPU.
Quality attributes
Runtime performance
The goal of this project is performance parity. We expect some improvements nevertheless, because of:
Compile time performance
The current fusion passes carefully avoid pathological compile time cases. Therefore, there will be no compile time improvement right off the bat. Nevertheless, there are opportunities for compile time improvements in the future:
Testability
The new codegen is more testable than the old, since the initial lowering is more high-level. We use standard filecheck tests.
Launch plans
We plan a gradual launch (emitter by emitter), starting with simple fusions (loop emitter), followed by more complex ones (tiling emitters). For each phase, we will test the changes against a large set of Google-internal benchmarks. We will also give clients sufficient time between these stages to test their workloads.
Footnotes
In principle, it works on the CPU as well. ↩
Beta Was this translation helpful? Give feedback.
All reactions