You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Data-tiling is enabled by default in IREE. Only CPU and VMVX backends implement the data-tiling and other backends do not have the implementation. Today, the encodings are materialized way earlier than dispatch formation, if there is a single executable target. I.e., the MaterializeHomogeneousEncodings pass transforms the virtual encodings into physical ops (e.g., pack/unpack/mmt4d/etc) in GlobalOptimization phase. If the executable target does not implement the data-tiling, the encodings are dropped by the pass. It enables multiple features in IREE:
Improve the input program with pack/unpack foldings.
Enable data layout propagation.
Enable const-evaluation for encoding ops (i.e., pack ops).
Good quality of code-generation for CPU backends.
The host can allocate the storage buffer based on the requirements from target devices.
However, there are at least two issues, IMO:
The program is different depending on the executable target. I.e., we could have different dispatch formation for different backends.
It only works for single device. I.e., it does not fit the multi-device project. We won't be able to do data-tiling for multi-device.
Thus, I think it is the evolution time for data-tiling.
Proposal
The proposal is switching data-tiling to late materialization path (i.e., materialize the encodings after dispatch formation), while we also keep the early materialization path.
With my recent encoding specialization work, we are able to address the pain point of storage buffer allocation, since the specialization can propagate the stroage size request from device to host (i.e., Stream dialect). See the writeup for how it works. It also makes the multi-device works, see the writeup about how it works.
In the late materialization path, we still have good quality of code for gemms. The performance issues are:
(a.) Not able to const-evaluate the encoding ops.
(b.) Not able to do pack/unpack foldings. It hurts the performance because some pack/unpack ops are just reshapes. On the early materialization path, we do not form such case into a single dispatch.
(c.) Not able to do data layout propagation.
Furthermore, (d.) it impacts all the backends that do not use data-tiling, if we can not cancel the encoding properly. Because some set/unset_encoding ops could be formed into their own dispatch, which results in additional kernal launch and it is a redundant copy.
(a.) is fixable and it can become a new project because we might need to reorganize the pipeline. Technically, the encoding can be resolved to physical operations like what we've done in encoding specialization.
(b.) and (c.) fall into the same category that how we propagate the encodings without materialization. I plan to start the work once the encoding specialization is ready for default.
(d.) is a little tricky. If the encoding ops are fused with other operations, they become nop. If they live in their own dispatch, we need to teach IREE to recognize the case at Flow phase and turn it into some flow ops. The encoding specialization will resolve the layout later on at Stream phase. We might be able to fold it away if it becomes something like stream.copy from (%src: tensor<?x?xf32>) to (%dest<?x?xf32>). The encoding is dropped because the specialization can query the resolved layout, ideally. I'll brainstorm with @benvanik (and also welcome other contributors!) about the case.
While there are tough issues here, I propose to add early materialization support to preprocessing phase. So users can still use the path when we are developing the next evolution of data-tiling.
The hard requirement of the switch is (d.) and encoding specialization. After the flip, I will look at the encoding propagation ((b.) + (c.)), and then the const-evaluation (a.).
This will unblock the data-tiling + multi-device and keep the same path in preprocessing phase.
RFC: Switching data-tiling to late materialization path
HackMD version: https://hackmd.io/H8zHCb7NSJOtQlteSiy9lg
Background
Data-tiling is enabled by default in IREE. Only CPU and VMVX backends implement the data-tiling and other backends do not have the implementation. Today, the encodings are materialized way earlier than dispatch formation, if there is a single executable target. I.e., the MaterializeHomogeneousEncodings pass transforms the virtual encodings into physical ops (e.g., pack/unpack/mmt4d/etc) in GlobalOptimization phase. If the executable target does not implement the data-tiling, the encodings are dropped by the pass. It enables multiple features in IREE:
However, there are at least two issues, IMO:
Thus, I think it is the evolution time for data-tiling.
Proposal
The proposal is switching data-tiling to late materialization path (i.e., materialize the encodings after dispatch formation), while we also keep the early materialization path.
With my recent encoding specialization work, we are able to address the pain point of storage buffer allocation, since the specialization can propagate the stroage size request from device to host (i.e., Stream dialect). See the writeup for how it works. It also makes the multi-device works, see the writeup about how it works.
In the late materialization path, we still have good quality of code for gemms. The performance issues are:
(a.) Not able to const-evaluate the encoding ops.
(b.) Not able to do pack/unpack foldings. It hurts the performance because some pack/unpack ops are just reshapes. On the early materialization path, we do not form such case into a single dispatch.
(c.) Not able to do data layout propagation.
Furthermore, (d.) it impacts all the backends that do not use data-tiling, if we can not cancel the encoding properly. Because some set/unset_encoding ops could be formed into their own dispatch, which results in additional kernal launch and it is a redundant copy.
(a.) is fixable and it can become a new project because we might need to reorganize the pipeline. Technically, the encoding can be resolved to physical operations like what we've done in encoding specialization.
(b.) and (c.) fall into the same category that how we propagate the encodings without materialization. I plan to start the work once the encoding specialization is ready for default.
(d.) is a little tricky. If the encoding ops are fused with other operations, they become nop. If they live in their own dispatch, we need to teach IREE to recognize the case at Flow phase and turn it into some flow ops. The encoding specialization will resolve the layout later on at Stream phase. We might be able to fold it away if it becomes something like
stream.copy from (%src: tensor<?x?xf32>) to (%dest<?x?xf32>)
. The encoding is dropped because the specialization can query the resolved layout, ideally. I'll brainstorm with @benvanik (and also welcome other contributors!) about the case.While there are tough issues here, I propose to add early materialization support to preprocessing phase. So users can still use the path when we are developing the next evolution of data-tiling.
The hard requirement of the switch is (d.) and encoding specialization. After the flip, I will look at the encoding propagation ((b.) + (c.)), and then the const-evaluation (a.).
This will unblock the data-tiling + multi-device and keep the same path in preprocessing phase.
Execution Plan
In the near future:
Larger projects afterward:
The text was updated successfully, but these errors were encountered: