-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Hardware][Xilinx] explicitly specify acc dep distance to avoid hidden pitfall #8
base: main
Are you sure you want to change the base?
Conversation
@tqchen @tmoreau89 Could you kindly help review this PR? Thanks. |
Thanks @zhanghaohit, for catching this error. I agree with the fact that the dependence should not be hardcoded. However in order to not to add too many parameters in the config file, can we derive the value from the VTA target (i.e. FPGA type) in Also from what I understand is that the dependence distance is generally a property of the memory part that the FPGA is using. For instance that distance will be different if one uses BRAM vs. Ultra-RAM? And on a different FPGA family this number will change. Correct? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comments above
I think the dependence distance here is defined by the user, to tell the compiler (e.g., XILINX vivado) that, to what extent, the software layer (i.e., TVM code generator) can guarantee there is no acc memory conflict. Say, if the distance = 4, meaning that the software will guarantee there is no conflict within 4 iterations. So the compiler can generate hardware code with smaller Thus, I also add the check in VTA runtime to do the VerifyDep. |
Thanks @tmoreau89 for the advice. We also thought of putting the value into the pkg_config. However, later we found out the dependence distance is not tightly related to the device family/type. In my opinion, the compiler will decide the II based on multiple factors. But eventually, it is based on the number of cycles on the datapath from "Read Mem" -> "Perform Ops" -> "Write Back", as we need to avoid RAW data hazard. In the "Read Mem" and "Write Back" stage, it is definitely related to the properties of the memory (eg. device family, uram vs bram, etc). But it is also related to the H/W circuit of accessing the data. For example, multiplexers as there are multiple accesses. The compile will judge based on the complexity of the overall circuit, and it could add registers (increase II) if the desired frequency target could not be satisfied. In "Perform Ops" stage, the cycles needed is mainly related to the op itself. For example, a 32-bit multiplication may require more cycles than a 8-bit multiplication. |
Thank you for the explanation. Do you mind breaking down what happens in the 32x32 configuration that causes correctness to fail (since the dependence analysis waive is unbounded)? I agree that there should be a constant set for hardware and software to agree on, but I'm not convinced that it should be set by a user, and instead be derived automatically by the hardware target we choose in pkg_config. This is to ensure that we can achieve an iteration interval of The example I can use is that if we are say we want to accumulate SRAM at address A and B for 3 consecutive cycles. We could do it (1) by updating each address consecutively, or (2) by interlacing the updates.
Now in the case where these back to back writes are detected in the runtime, we would have an invalid schedule. So assuming the TVM compiler produces a valid schedule, we would end up with interlaced writes to A and B:
So therefore the point I'm hoping to make is that this 2 cycle wait time in enforced by the SRAM, and should not be exposed to the user to be set as a parameter in VTA_config. |
@remotego I just read your comment that must have been posted as I wrote my reply. Your explanation makes sense, ultimately the critical path in the accumulation is READ -> ADD to GEMM result -> WRITE, which can be more than 2 cycles if we account for the addition logic, and various multiplexers etc. |
I assume that if you have an FPGA that needs to be clocked at a high frequency, you'll likely end up with more cycles. |
So to conclude, I agree with you that we may want this parameter to be larger than 2 cycles if the hardware pipeline becomes more complicated. So the big question here is: how can we ensure that it is set correctly moving forward. For instance a DISTANCE=2 works for 16x16, but might break for 32x32. Is there a way that the user as she performs design space exploration does not have to think about updating this parameter? Or at least receives some guidance to update it to the right value? |
Perhaps one way to look at this is to start with DISTANCE=3 by default. And if the II>1 for the GEMM, we issue a warning telling the user to increase the distance to 4. The process can repeat until the II of 1 is achieved, and the runtime will be informed of the actual distance so proper checks can be inserted. |
I'll think some more about this problem overnight, thanks for bringing attention to this issue @zhanghaohit @remotego |
I love this idea. Our ultimate goal is to achieve ii = 1, thus it would be great if we could achieve it by some intelligent scripting... only problem is that we need to introduce a way to feedback the results to S/W side. |
Thanks @tmoreau89 for the advice. For this issue, do you suggest to do it manually? For example, we have a config (e.g., |
I would agree with
, and avoid requiring a user to specify the |
@zhanghaohit I think that the warning approach is a reasonable compromise between user-defined flexibility and making sure that they set the value correctly. To echo @liangfu , it would be ideal to derive the value in pkg_config based on simple rules of FPGA target, and matrix multiplication size. We can tweak those rules if say we discover that the dependence distance needs to be increased. If you agree I suggest we go ahead with these changes. |
This PR add support for cross compiling dylib to x86 macos Co-authored-by: tqchen <[email protected]>
Bug
Unrestrained #pragma HLS DEPENDENCE in VTA core waiving all loop-carried dependencies, causing wrong results in 32x32 configuration
Solution
make acc dependence configurable, and also do corresponding checking in VTA runtime (see here).