[RFC] New emitters for XLA:GPU #13093

jreiffers · 2024-05-27T10:20:10Z

jreiffers
May 27, 2024
Collaborator

New Emitters for XLA:GPU

Note: this is implemented but not launched. You can try it out using --xla_gpu_enable_mlir_emitters.

Context

Objective

Replace XLA:GPU's aging legacy emitters with a new maintainable, at least equally performant, more composable baseline codegen. This codegen should:

implement 100% of XLA:GPUs feature set (i.e., allow a complete removal of the legacy emitters)
be composable with other codegen strategies (in particular block-level codegen - e.g. Triton) and work on any GPU¹.
Do this in with a reasonably-sized investment.

Background

XLA:GPU's legacy emitters are a constant maintenance burden. We estimate that during the priority fusion project, at least one person was permanently chasing down bugs and regressions caused by it. The major problems are:

Maintainability and understandability: The codegen is hard to understand, since it goes straight from HLO to LLVM. Changes take much longer than they should. Almost no one understands how codegen works. Almost everyone forgets immediately after working on it.
Brittleness: it is easy to run off the "happy path", causing various problems during both compile time and runtime (crashes, slowness, incorrect results)
Modeling: as mentioned above, the codegen is hard to understand, and likewise very difficult to model. Modeling the codegen accurately is crucial for cost modeling.

Long term, the hope is that these problems will be solved by new codegen strategies. However, there is no near-term solution in sight, and there is substantial skepticism about a full replacement being feasible in 2024. Meanwhile, the problems persist, slowing down other projects.

We propose to replace the legacy emitters completely and quickly. To address all of the problems above, we need to:

Integrate codegen with the indexing map mechanism that we use in modeling. If modeling and codegen are based on the same abstractions, there is no risk of them going out of sync.
Have a more gradual lowering from HLO to LLVM. The abstraction jump is too high to do in one step.
Have a reimplementation of all of XLA:GPU's elemental ops.

The obvious choice for this is an MLIR pipeline. MLIR already has many lowerings for simple instructions (all of math). Indexing maps are based on MLIR's affine maps, which let us reuse some infrastructure. Block-level codegen uses MLIR, so this makes XLA:GPU less special.

Design

Overview

Today, XLA:GPU goes in one abstraction step from HLO to LLVMIR:

HloModule Transpose

%fused_computation {
  %p0 = f32[20,160,170] parameter(0)
  ROOT %transpose = f32[20,170,160]
    transpose(%p0),
    dimensions={0,2,1}
}

ENTRY main {
  %param = f32[20,160,170] parameter(0)
  ROOT %fusion = f32[20,170,160] 
    fusion(%param),
    kind=kInput, calls=%fused_computation
}

→

define void @fusion(ptr noalias align 16 dereferenceable(819200) %arg0, ptr noalias align 128 dereferenceable(819200) %arg1) {                                           
entry:
  %loop2.invar_address12 = alloca i32, align 4
  %loop1.invar_address6 = alloca i32, align 4
  %loop2.invar_address = alloca i32, align 4
  %loop1.invar_address = alloca i32, align 4                                           
  %thread.id.x = call i32 @llvm.nvvm.read.ptx.sreg.tid.x(), !range !2                  
  %block.id.x = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x(), !range !3              
  %0 = udiv i32 %thread.id.x, 32                                                    
  %thread.id.1 = urem i32 %0, 4
  %thread.id.2 = urem i32 %thread.id.x, 32
  %lane_id = urem i32 %thread.id.x, 32                                                 
  %1 = udiv i32 %block.id.x, 1 
  %2 = urem i32 %1, 2                                                               
  %3 = udiv i32 %block.id.x, 2
  %4 = urem i32 %3, 1                                                               
  %5 = udiv i32 %block.id.x, 2
  %tile_origin.0 = mul i32 %5, 1
  %tile_origin.1 = mul i32 %4, 32                                                                
  %tile_origin.2 = mul i32 %2, 32                                                      
  store i32 %thread.id.1, ptr %loop1.invar_address, align 4                         
  br label %loop1.loop_header                                                          

loop1.loop_header:
  %loop1.indvar = load i32, ptr %loop1.invar_address, align 4                       
  %6 = icmp uge i32 %loop1.indvar, 32                                               
  br i1 %6, label %loop1.loop_exit, label %loop1.loop_body                                       
                                                                                                 
loop1.loop_body:
  %invar.inc = add nuw nsw i32 %loop1.indvar, 4                                                              
  store i32 %invar.inc, ptr %loop1.invar_address, align 4                                                    
  %7 = icmp eq i32 %loop1.indvar, %thread.id.1                                                           
…

While there are some C++ classes to aid in this conversion, they are fairly limited (mostly helping slightly with indexing) and they do not compose at all. This abstraction step is too big, slowing us down and reducing the quality of the codebase overall.

The core of this proposal is to introduce a more gradual lowering. We propose the following steps:

Partitioning the HLO execution graph into functions.
Lowering to a tensor-level program. Note: not part of the initial project: we lower directly from HLO to scalars. This will be needed in the future for block-level codegen.
- Tiling the program (tensor to tensor).
Tiling the program (tensor to scalars) and emitting scalar MLIR (SCF + tensors).
Gradually lowering to LLVM
- Indexing optimization
- Function inlining
- Math approximation
- Lowering tensors to memory instructions

Detailed design

Graph partitioning

Consider a fusion like the one depicted above. Each pad and slice op changes the indexing of its input operand, so operands cannot be reused if they have multiple users. In this particular example, this is the case for add.16889 and add.16885. Since the latter is a transitive user of the former, the former's operands will be emitted at least four times using the approach employed by the current emitters.

There are many ops that trigger this behavior: slice, pad, gather, reshape and more.

This is the major cause of exponential code-size and compile-time issues in the legacy emitter infrastructure. Even though all the shapes here are tiny (so the code would run fast enough), we cannot emit fusions like this, because compilation would be unbearably slow.

The new emitters employ a different strategy: instead of recursively generating code for each node, we first partition the graph into individual functions that can be emitted together without emitting any node more than once. These functions are then connected by function calls.

In our graph partitioning, we place each HLO node into exactly one subgraph (or function). The subgraphs are built starting from the root of the fusion. The rules are simple:

If all users of an instruction are in the same subgraph (= use the same indexing), the instruction is placed in the same subgraph.
Otherwise, the instruction starts a new subgraph.
Some instructions change the indexing (e.g. transpose has a different output than input indexing). We do not currently track the exact indexing in this step; all indexings are assumed to be unique.

Some users can use the same operand more than once, in which case the uses usually have different indexing. For example, gather can use the index operand more than once if there's more than one index dimension, but the uses will be at different coordinates.

In this example, the lower function will contain two calls of the upper function.

Tiling to points / initial codegen

Subgraphs map straightforwardly to MLIR functions. We use the following signature:

  func.func @computation_add_16885(
    %param0: tensor<1x2x3xf32>,
    %param1: tensor<2x3x4xi32>,
    %x: index,
    %y: index) -> f32

That is, there is one tensor parameter per fusion parameter (regardless of whether or not the parameter is actually used by the function) and one index parameter per output dimension. The semantics are: compute the value of the root node(s) at the given indices.

The codegen for function bodies is straightforward and will not be described here. Instead we refer to the implementation (look for SubgraphToMlir).

Epilogue fusion

The signatures above do not work as-is for epilogues of input fusions: fusions where the input indexing is not defined by the root of the fusion. Because the iteration space is defined by the inputs of the hero (the reduction in this case), we cannot simply invoke a function for the root - it only contains 8192 elements in this case, limiting the achievable parallelism and resulting in poor memory access patterns. Instead, we isolate and inject the reduction - a fancy name for "it becomes a function parameter for the epilogue subgraph". The signature for the negation subgraph therefore becomes:

  func.func @fusion_835_negate_2(
    %param0: tensor<4x2048x128xf32>,
    %x: index,
    %y: index,
    %reduce_170: f32) -> f32

This approach is used for all fusions that customize the iteration order and support epilogues (transpose, reduce, concatenate).

Because it is possible for instructions to be both part of the epilogue and the prologue of a hero (e.g. reduction -> broadcast -> reduction, with both reductions being roots), we duplicate the instructions in the epilogue and emit them twice.

Inlining

The approach above sometimes results in functions that are too small. For example, sequences of transpose and broadcast and some reshapes result in functions that only permute the indexing. The partitioning code never duplicates HLO nodes into multiple subgraphs, since it knows nothing about the implementation of instructions. Instead, we rely on function inlining to later merge these functions into their callers. To avoid code bloat, we only merge functions that are small or are called only once. Inlining is done in postorder, so we would inline other functions into the called function before deciding whether to inline it.

Indexing optimization

The optimization of index computations is a core component of emitters. Not just because of the direct cost of the computations, but also because clean index computations are crucial for optimizations like load-store vectorization and loop-invariant code motion to work efficiently, in particular in the presence of bounds checks.

We have identified the following core optimizations:

Arithmetic simplification. Of particular importance here are the simplification of modulus and divisions, extracting symbols representing loop variables from them. For example, in a typical simple index computation for the minor dimension like this: (thread_id * 4 + s0) % 1024, where s0 represents the loop variable with a domain of [0;4), s0 can be extracted from the expression, yielding (thread_id * 4) % 1024 + s0. This enables LLVM's load-store vectorizer to optimize loads.
Range-based simplification. We emit many bounds checks that can later be optimized away, because the range of indices is known.
Choosing the correct index type. Using i64 indices only when strictly necessary can result in significantly simpler code.
Merging index computations: this is mostly done automatically.

Tensor lowerings

To keep as many optimization opportunities, we keep the program functional as long as possible. Therefore, the program uses tensors with value semantics until they are lowered to LLVM. We only use a handful of instructions:

tensor.insert and tensor.extract (and our custom versions of them: xla_gpu.predicated_insert and xla_gpu.predicated_extract).
xla_gpu.allocate_shared. This instruction allocates a shared memory tile, which semantically is a distributed tensor.
xla_gpu.sync_threads. This instruction lowers to a barrier and synchronization point. Logically, it also synchronizes all instructions that happen on distributed tensors.
xla_gpu.atomic_rmw. This instruction lowers to an CAS-loop or to a simpler atomic operations, if they are supported.

We lower tensor.insert and tensor.extract directly to LLVM.load and LLVM.store ops. allocate_shared lowers to a global memory buffer.

Emitters

This section gives a high-level overview of each of the new emitters. We only describe parts that are interestingly different from the legacy emitters in detail. Likewise, many components are shared between the emitters; they will only be described once.

Loop

The new loop emitter (like all new emitters) uses the existing thread to output indexing maps (ComputeThreadIdToOutputIndexing) for codegen. The core of the emitter looks like this:

func.func @loop_fusion(%input: tensor<128x64xf32>, %output: tensor<128x64xf32>)
   -> tensor<128x64xf32> {
  %thread_id = gpu.thread_id x
  ...
  %ret = scf.for %i = %c0 to %c8 step %c1 iter_args (%arg0 = %output) 
    -> tensor<128x64xf32> {
    %output_y = affine.apply affine_map<(d0)[s0] -> ((d0 * 8 + s0) % 64)>(%thread_id)[%i]
    %output_x = affine.apply affine_map<(d0)[s0] -> ((d0 * 8 + s0) / 64)>(%thread_id)[%i]
    %result = xla_gpu.pure_call @root(%input, %output_y, %output_x) : ... f32
    %updated_output = tensor.insert %result into %arg0[%output_y, %output_x] : ...       
    scf.yield %updated_output : tensor<128x64xf32>
  }
  return %ret : tensor<128x64xf32>
}

The loop is generated from the output indexing map. In general, it will also include a bounds check for the output. Output tensors are passed to the function (destination-passing style) and are passed through the loops using iter args, with each iteration updating one particular element of the tensor. This is not a limitation of the lowerings: there can be multiple insertions per iteration, the loop emitter simply does not make use of this. There can be multiple output tensors.

Vectorization works similarly to the code shown above: an inner loop that accesses consecutive elements of the input/output is generated. Thanks to our indexing simplifications, nothing else is needed: the loops will automatically be vectorized.

All subgraph functions are generated automatically using the mechanism described above.

Transpose

The transpose emitter is similar to the loop emitter. The only difference is that there is a shared memory step between the read and write.

In the read step, consecutive threads process consecutive (physically) elements of the input and write the results to shared memory. In the write step, threads process the epilogue and write consecutive (physically) elements. The core of the emitter looks like this:

func.func @transpose_fusion(%input: tensor<128x64xf32>, %output: tensor<64x128xf32>)
   -> tensor<64x128xf32> {
  %thread_id = gpu.thread_id x
  %shared_tile = xla_gpu.allocate_shared : tensor<32x32xf32>
  ...
  %filled_shared = scf.for %i = %c0 to %c8 step %c1 iter_args (%arg0 = %shared_tile) 
    -> tensor<128x64xf32> {

	(compute indexing and element like in the loop emitter)

    %updated_output = tensor.insert %result into %arg0[%shared_y, %shared_x] : ...       
    scf.yield %updated_output : tensor<128x64xf32>
  }

	(synchronize all threads; conceptually this merges the writes that happened before
	into a single tensor. It simply lowers to a barrier and sync; no data is
transferred.)

  %synced = xla_gpu.sync_threads %filled_shared : tensor<32x32xf32>

  %updated_output = scf.for %i = %c0 to %c8 step %c1 iter_args (%arg0 = %output) 
    -> tensor<128x64xf32> {
	(compute indexing like in the loop emitter)

    %shared_element = tensor.extract %synced[%shared_x, %shared_y] : tensor<32x32xf32>
	
	(Evaluate the epilogue like described in epilogue fusion)

    %result = xla_gpu.pure_call @epilogue(%input, %output_y, %output_x, %shared_element)
    %updated_output = tensor.insert %result into %arg0[%out_y, %out_x] : ...       
    scf.yield %updated_output : tensor<64x128xf32>
  }

  return %ret : tensor<64x128xf32>
}

This is very close to the legacy transpose emitter, but it supports vectorizing loads and stores. Later, we will move towards direct tiled codegen for this.

Reduce

There are three different reduce emitters: row, multi-row and column. They differ in which threads combine elements and how shared memory is used.

The current algorithm works as follows:

Each thread loops over a subshape of the input and reduces elements within the subshape.
Row-reduction only: the elements in the warp are reduced to one (or in the case of multi-row reductions, n)
The result is written to shared memory (row, column reduction) or global memory, after applying the epilogue (multi-row reduction)
Partial results are loaded from shared memory (transposed in the case of column reductions)
Elements in the warp are again reduced to one
The epilogue is evaluated
The result is written to global memory, using atomics if more than one block works on an element of the output.

This way, reads are always coalesced; writes may be in the case of multi-row reductions. So in summary, the new reduction emitters work as follows:

Row reduction:
- Each thread processes n elements of the input and produces one output element. The result is kept in registers.
- A warp shuffle reduces 32 elements to one. Lane 0 writes the result to shared memory.
- The first warp reads up to 32 elements from shared memory. Another warp shuffle reduces these elements to one. Lane 0 writes the result to global memory.
Multi-row reduction:
- Like above, but the first warp shuffle reduces 32 elements to 2,4,8 or 16. The result is written to global memory instead of shared memory (since there is no second shuffle step).
Column reduction:
- Each thread processes n elements of the input and produces one output element. The result is written to shared memory, but transposed.
- A warp shuffle reduces up to 32 elements from shared memory. Lane 0 writes the result to global memory.

There are some limitations in the current codegen: shared memory is used even when a row reduction is handled by a single warp and there is no multi-column reduction (similar to the multi-row reduction). The runtime impact of these is small but measurable (1-2%). In the new emitters, we fixed the first issue, since it is much more straightforward thanks to the more high-level lowering.

Scatter

The scatter emitter maps every element of the updates to a single thread. For every element we compute the corresponding scatter index and then combine the update element with the element of the output. If the indices are not unique, then we emit an xla::gpu::AtomicRMWOp, otherwise a non-concurrent update is feasible.

Here is an example of the actual IR that is generated by the scatter emitter.

#map = affine_map<()[s0, s1] -> (s0 + s1 * 128)>
#map1 = affine_map<()[s0, s1] -> (((s1 * 64 + s0 floordiv 2) floordiv 3) mod 24)>
#map2 = affine_map<()[s0, s1] -> (((s0 + s1 * 128) floordiv 3) mod 2)>
#map3 = affine_map<()[s0, s1] -> ((s0 + s1 * 128) mod 3)>
func.func @fusion(%operand: tensor<10x5xf32>, %indices: tensor<24x1xi32>,
                  %updates: tensor<24x2x3xf32>,
                  %out: tensor<10x5xf32>) -> tensor<10x5xf32> {
  %thread_id_x = gpu.thread_id  x {xla.range = [0 : index, 127 : index]}
  %block_id_x = gpu.block_id  x {xla.range = [0 : index, 1 : index]}
  %0 = affine.apply #map()[%thread_id_x, %block_id_x]
  %1 = arith.cmpi sge, %0, %c0 : index
  %2 = arith.cmpi sle, %0, %c143 : index
  %in_bounds = arith.andi %1, %2 : i1
  %4 = scf.if %in_bounds -> (tensor<10x5xf32>) {
    %5 = affine.apply #map1()[%thread_id_x, %block_id_x]
    %6 = affine.apply #map2()[%thread_id_x, %block_id_x]
    %7 = affine.apply #map3()[%thread_id_x, %block_id_x]
    %extracted = tensor.extract %updates[%5, %6, %7] : tensor<24x2x3xf32>
    %extracted_0 = tensor.extract %indices[%5, %c0] : tensor<24x1xi32>
    %8 = arith.index_cast %extracted_0 : i32 to index
    %9 = arith.cmpi sge, %8, %c0 : index
    %10 = arith.cmpi sle, %8, %c8 : index
    %12 = arith.addi %8, %6 : index
    %slice_in_bounds = arith.andi %9, %10 : i1
    %updated_tensor = scf.if %slice_in_bounds -> (tensor<10x5xf32>) {
      %atomic_rmw = xla_gpu.atomic_rmw %out[%12, %7] : tensor<10x5xf32> {
      ^bb0(%arg4: f32):
        %14 = arith.addf %arg4, %extracted : f32
        xla_gpu.yield %14 : f32
      }
      scf.yield %atomic_rmw : tensor<10x5xf32>
    } else {
      scf.yield %out : tensor<10x5xf32>
    }
    scf.yield %updated_tensor : tensor<10x5xf32>
  } else {
    scf.yield %out : tensor<10x5xf32>
  }
  return %4 : tensor<10x5xf32>
}

Others

Concatenate, DUS and input-slices are all manifestations of the same idea: basically use the loop emitter, but use a different iteration space.

Future evolution

The concepts described above straightforwardly extend to tensor lowerings and tiling.

Lowering to tensors

Lowering to tensors is a necessary step for the further simplification of the transpose and reduce emitters and an integration with block-level codegen backends. The graph partitioning can be kept as is; instead of emitting scalar code, we emit MHLO instead. The elemental MLIR emitter will evolve into logic to specialize a tensor function to a scalar function instead.

Tiling

Once we have a tensor program, tiling is just a matter of specializing functions for particular tensor sizes and propagating indexing maps. Instead of indices we can outline functions with offsets-sizes-strides arguments. Some operations cannot be tiled arbitrarily with reasonable indexing maps (e.g. certain reshapes). When the fusion clusters are formed, we can compute the constraints on the slice parameters.

Dependencies

This project takes on no new dependencies. We already depend on MLIR via Triton emitters. The total code size will be less than before.

CPU

Once the GPU emitters are migrated, XLA:CPU will be the only user of elemental_ir_emitter. While elemental_ir_emitter is not a significant maintenance burden (less than one non-rollback, non-cleanup CL per month in the last year), it might be a good idea to migrate CPU to the new infrastructure that we build here. We expect this to just work, but no significant time was spent on it so far, since it is outside the responsibility of the team. However, we're keeping this constraint in mind and avoid making any decisions that would make it harder to adopt the new emitters on CPU.

Quality attributes

Runtime performance

The goal of this project is performance parity. We expect some improvements nevertheless, because of:

More efficient index computations, better CSE
Better vectorization (through optimizations, but also explicit vectorization in the reduce and transpose emitters)
More efficient reductions by avoiding shared memory
Speculative: More efficient launch grids

Compile time performance

The current fusion passes carefully avoid pathological compile time cases. Therefore, there will be no compile time improvement right off the bat. Nevertheless, there are opportunities for compile time improvements in the future:

Simplified analyses in fusion: no need to estimate codegen cost anymore - runtime cost is the limiting factor in all cases.
Shared indexing analyses for fusion and codegen.

Testability

The new codegen is more testable than the old, since the initial lowering is more high-level. We use standard filecheck tests.

Launch plans

We plan a gradual launch (emitter by emitter), starting with simple fusions (loop emitter), followed by more complex ones (tiling emitters). For each phase, we will test the changes against a large set of Google-internal benchmarks. We will also give clients sufficient time between these stages to test their workloads.

In principle, it works on the CPU as well. ↩

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] New emitters for XLA:GPU #13093

{{title}}

Replies: 0 comments

Select a reply

[RFC] New emitters for XLA:GPU #13093

jreiffers May 27, 2024 Collaborator

New Emitters for XLA:GPU

Context

Objective

Background

Design

Overview

Detailed design

Graph partitioning

Tiling to points / initial codegen

Epilogue fusion

Inlining

Indexing optimization

Tensor lowerings

Emitters

Loop

Transpose

Reduce

Scatter

Others

Future evolution

Lowering to tensors

Tiling

Dependencies

CPU

Quality attributes

Runtime performance

Compile time performance

Testability

Launch plans

Footnotes

Replies: 0 comments

jreiffers
May 27, 2024
Collaborator