Description
Describe the bug
This is purely about the documentation.
In the documentation about group normalization, it is stated:
Relation to Layer Normalization: If the number of groups is set to 1, then this operation becomes identical to Layer Normalization.
However, that is not true.
Assume an input tensor x
of shape [B,T,F] (batch, time, feature-dim) (time could also be H/W instead; feature-dim can also be the channels).
In layer normalization, the mean you calculate is:
mean = reduce_mean(x, axis=-1, keepdims=True) # shape [B,T,1]
You normalize just over the feature axis.
In group normalization with G=1 (ignore the group shape then), the mean you calculate is:
mean = reduce_mean(x, axis=[1,2], keepdims=True) # shape [B,1,1]
You normalize over all axes except the batch axis and the newly added group axis (doesn't matter if G=1).
Or do I misunderstand sth? I wonder because the same wrong statement is in the original group-normalization paper.
The figure from the paper (also here) is also misleading:
In this figure, it looks like layer-normalization normalizes over H/W as well. But this is not the case (at least commonly, and also with the default options).
So, this figure is wrong about layer-normalization (it would just normalize over C, not H/W).
But the figure is correct for group-normalization as you have implemented it (it normalizes over all axes except N/G).
I also formulated the question here.