You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a question regarding how Timeloop calculates the overall execution latency of a layer. In the tutorial videos, it is mentioned that the latency is calculated based on a pipelined concept, though this is not very clear to me.
I would appreciate it if you could explain the latency estimation process for a simple architecture (DRAM + Global Buffer + (1 PE + RF)).
So far, I have noticed that we have latencies in data movements between different memory hierarchies and latencies in computing MACs in each cycle, but I have no clear idea how the overall latency is estimated based on these different latency values and why this estimation is rational.
Thanks.
The text was updated successfully, but these errors were encountered:
Timeloop first computes the amount of data (or compute) that must move through each level of the hierarchy for that specific mapping.
As a cartoon example, let's say that for a given mapping we need to move 100 bytes from DRAM->GB, 1000 bytes from GB->RF and perform 10000 computations.
Now lets say our cartoon architecture's DRAM->GB bandwidth is 5 bytes per clock, GB->RF is 20 bytes per clock, and we have 1000 parallel MACs in the PE.
Timeloop will compute the DRAM->GB link as needing 100/5 = 20 clock cycles total to move all the data, 1000/20 = 50 clock cycles for the GB->RF link, and 10000/1000 = 10 clock cycles for the MACs to do their work.
This means the GB->RF link is the bottleneck, so the overall execution time will be reported as 50 clock cycles.
I have a question regarding how Timeloop calculates the overall execution latency of a layer. In the tutorial videos, it is mentioned that the latency is calculated based on a pipelined concept, though this is not very clear to me.
I would appreciate it if you could explain the latency estimation process for a simple architecture (DRAM + Global Buffer + (1 PE + RF)).
So far, I have noticed that we have latencies in data movements between different memory hierarchies and latencies in computing MACs in each cycle, but I have no clear idea how the overall latency is estimated based on these different latency values and why this estimation is rational.
Thanks.
The text was updated successfully, but these errors were encountered: