Open
Description
Following #60 bug, we are moving to power of 2 scaling factors. For most operations, this is a very simple change (power of 2 being stable under mul
, max
, min
, ...).
Nevertheless, scalify
rules for other operations are not directly compatible with power of 2. Typically, add
/ sub
/ dot_general
can lead to non power of 2 scaling factors. The simplest way to keep a power of 2 scaling is to use rounding down.
In the case of add
, this would lead in some situations to unchanged scaling, potentially creeping to overflowing if add
ops are serialized (e.g. gradient accumulation). To prevent that, we should experiment with stochastic rounding (with the typical example of gradient accumulation in mind).