RFC: Switching data-tiling to late materialization path #19967

hanhanW · 2025-02-11T23:02:06Z

RFC: Switching data-tiling to late materialization path

HackMD version: https://hackmd.io/H8zHCb7NSJOtQlteSiy9lg

Background

Data-tiling is enabled by default in IREE. Only CPU and VMVX backends implement the data-tiling and other backends do not have the implementation. Today, the encodings are materialized way earlier than dispatch formation, if there is a single executable target. I.e., the MaterializeHomogeneousEncodings pass transforms the virtual encodings into physical ops (e.g., pack/unpack/mmt4d/etc) in GlobalOptimization phase. If the executable target does not implement the data-tiling, the encodings are dropped by the pass. It enables multiple features in IREE:

Improve the input program with pack/unpack foldings.
Enable data layout propagation.
Enable const-evaluation for encoding ops (i.e., pack ops).
Good quality of code-generation for CPU backends.
The host can allocate the storage buffer based on the requirements from target devices.

However, there are at least two issues, IMO:

The program is different depending on the executable target. I.e., we could have different dispatch formation for different backends.
It only works for single device. I.e., it does not fit the multi-device project. We won't be able to do data-tiling for multi-device.

Thus, I think it is the evolution time for data-tiling.

Proposal

The proposal is switching data-tiling to late materialization path (i.e., materialize the encodings after dispatch formation), while we also keep the early materialization path.

With my recent encoding specialization work, we are able to address the pain point of storage buffer allocation, since the specialization can propagate the stroage size request from device to host (i.e., Stream dialect). See the writeup for how it works. It also makes the multi-device works, see the writeup about how it works.

In the late materialization path, we still have good quality of code for gemms. The performance issues are:

(a.) Not able to const-evaluate the encoding ops.
(b.) Not able to do pack/unpack foldings. It hurts the performance because some pack/unpack ops are just reshapes. On the early materialization path, we do not form such case into a single dispatch.
(c.) Not able to do data layout propagation.

Furthermore, (d.) it impacts all the backends that do not use data-tiling, if we can not cancel the encoding properly. Because some set/unset_encoding ops could be formed into their own dispatch, which results in additional kernal launch and it is a redundant copy.

(a.) is fixable and it can become a new project because we might need to reorganize the pipeline. Technically, the encoding can be resolved to physical operations like what we've done in encoding specialization.

(b.) and (c.) fall into the same category that how we propagate the encodings without materialization. I plan to start the work once the encoding specialization is ready for default.

(d.) is a little tricky. If the encoding ops are fused with other operations, they become nop. If they live in their own dispatch, we need to teach IREE to recognize the case at Flow phase and turn it into some flow ops. The encoding specialization will resolve the layout later on at Stream phase. We might be able to fold it away if it becomes something like stream.copy from (%src: tensor<?x?xf32>) to (%dest<?x?xf32>). The encoding is dropped because the specialization can query the resolved layout, ideally. I'll brainstorm with @benvanik (and also welcome other contributors!) about the case.

While there are tough issues here, I propose to add early materialization support to preprocessing phase. So users can still use the path when we are developing the next evolution of data-tiling.

The hard requirement of the switch is (d.) and encoding specialization. After the flip, I will look at the encoding propagation ((b.) + (c.)), and then the const-evaluation (a.).

This will unblock the data-tiling + multi-device and keep the same path in preprocessing phase.

Execution Plan

In the near future:

Enable encoding specialization pass by default.
Remove the tech debt from encodings. Issue 19897
Add support for SetEncoding + materialization to preprocessing. It can be done by using Stream/HAL analysis like what we've been doing today.
Be able to cancel redundant copies from set_encoding/unset_encoding dispatches.
Disable the early materialization pass by default.

Larger projects afterward:

Prototype the encoding propagation.
Enable the const-evaluation for encoding ops.

The text was updated successfully, but these errors were encountered:

hanhanW · 2025-02-11T23:05:34Z

cc @benvanik @MaheshRavishankar @bjacob @Max191 @pashu123 @lialan @qedawkins @kuhar

hanhanW self-assigned this Feb 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Switching data-tiling to late materialization path #19967

RFC: Switching data-tiling to late materialization path #19967

hanhanW commented Feb 11, 2025

hanhanW commented Feb 11, 2025

RFC: Switching data-tiling to late materialization path #19967

RFC: Switching data-tiling to late materialization path #19967

Comments

hanhanW commented Feb 11, 2025

RFC: Switching data-tiling to late materialization path

Background

Proposal

Execution Plan

hanhanW commented Feb 11, 2025