This document explains how chainer-computational-cost estimates
theoretical computational cost of each type of layer.
Unless otherwise specified, stands for the input to the layer and
is output. is the number of elements in
(equivalent to x.size
), if is empty or does not exist,
.
The basic strategy of how the "theoretical computational cost" is defined is written in README.md.
Max operation can be executed by a single instruction. chainer-computational-cost considers it as one floating point operation.
In case is neither scalar nor same shape as , it is broadcasted internally, but cost for broadcasting operation is ignored.
Item | Value |
---|---|
FLOPs | |
mread | , where is learned parameter |
mwrite | |
params | w_shape : shape of |
Max operation can be executed by a single instruction. chainer-computational-cost considers it as one floating point operation.
Item | Value |
---|---|
FLOPs | |
mread | |
mwrite | |
params | N/A |
Leaky ReLU applies an elementwise multiplication with slope gradient followed by a max operation. Strictly speaking, the number of multiplication to be conducted is depend on the input data, but for simplicity, chainer-computational-cost treats Leaky ReLU as .
Item | Value |
---|---|
FLOPs | |
mread | |
mwrite | |
params | N/A |
Sigmoid is an elementwise operation that consists of four floating point operations, which are minus, exp, +1 and inv, for each element.
Item | Value |
---|---|
FLOPs | |
mread | |
mwrite | |
params | N/A |
Here, let be the depth of the input over the specified axis
(c = x.shape[axis]
) and be the product of spatial size
(size of without axis
axis).
In Softmax, at first exp is calculated for each element,
that is FLOPs.
Then they are summed over the axis
axis, which requires FLOPs
for each spatial position ( FLOPs).
Finally each elements are devided by the sum ( FLOPs).
Item | Value |
---|---|
FLOPs | |
mread | |
mwrite | |
params | axis : axis of the softmax operation |
As index calculation is ignored in chainer-computational-cost, broadcasting is theoretically 0 FLOPs.
Item | Value |
---|---|
FLOPs | |
mread | |
mwrite | |
params | shape : output shape |
Concatenation is just a memory copy, hense no floating point operation will be conducted. Depending on concatenation axis, index calculation might be needed but chainer-computational-cost ignores such operations.
Item | Value |
---|---|
FLOPs | |
mread | |
mwrite | |
params | axis : concatenation axis |
Reshape operation basically neither changes nor reads the data. It just makes an array with same data reference with different metadata.
If your environment cannot do in-place reshape, consider overwriting by a custom cost calculator (refer README.md).
Item | Value |
---|---|
FLOPs | |
mread | |
mwrite | |
params | in_shape : input shape, out_shape : output shape |
In bilinear resize operation, each output pixel value is calculated by 4 neighboring source pixels in the input image.
In order to know where the source pixel is, a few index calculations including floating point arithmetic is needed, but these are ignored since chainer-computational-cost ignores such index calculations.
To calculate an output value from 4 source pixels, first 3 FLOPs is spent for horizontal interpolation from 2 source pixels, then another 3 FLOPs for vertical, and finally 3 FLOPs for inter-axis interpolation, therefore in total 9 FLOPs for each output pixel.
As for memory access, especially in case of expanding the source image, duplicated memory read will happen. For example, if the input image is 8x8 and the output size is 32x32, naively reading memory runs 4096 times, even though the actual size of the input is only 64. In order to avoid such a contradiction, chainer-computational-cost introduces a policy to treat such case as if it loads the entire input data only once.
Conversely, in case not all the image is needed (for example input is 32x32 and the output is 8x8, where memory read is only 128), chainer-computational-cost simply counts 4 memory reads for each output pixel.
Either smaller number is used for memory read estimation. In other words, memory read is formulated as .
Considering extreme cases like shrinking horizontally and expanding vertically, this logic should be much complicated, but for simplicity chainer-computational-cost only uses the above formula.
Item | Value |
---|---|
FLOPs | |
mread | |
mwrite | |
params | size : output size |
Transpose operation is just copying memory with no FLOPs.
Item | Value |
---|---|
FLOPs | |
mread | |
mwrite | |
params | axes : transpose axes |
Extract part of an array. This operation is zero FLOPs. Most of tensor libraries have view feature, which doesn't actually create a new array unless necessary, but this is not considered in chainer-computational-cost. Memory read runs only for the necessary elements, so both memory read and write are same as the size of output tensor.
Item | Value |
---|---|
FLOPs | |
mread | |
mwrite | |
params | slices : list of slices, a slice is an int or a tuple with 3 elements |
Convolution operator essentially calculates an output value by convolving the input feature map by a corresponding filter whose size is . The computational cost is FLOPs for each output pixel. Including bias, it becomes simply . So in total . Here, output size and can be calculated by chainer.utils.get_conv_outsize.
If there is no bias, it will be FLOPs.
In case of grouped convolution, it can be considered as just concatenating result of convolution whose input has channels and output is channels. Each group costs FLOPs/group, so in total FLOPs.
If fma_1flop
is set to True
,
it will be FLOPs.
Exsistence of bias does not matter this case.
Dilated convolution does not change theoretical computational cost explained above, although it usually significantly affects to the actual performance.
Although a source pixel can be read multiple times in the most native convolution implementation, chainer-computational-cost counts them only once, therefore the memory read is counted as if the entire input data and parameters (weights, biases) are loaded from memory only at once. Padding is ignored, too.
FMA option can be switched by fma_1flop: bool
keyword argument specified
to ComputationalCostHook.
Item | Value |
---|---|
FLOPs(FMA) | |
FLOPs(no-FMA) | |
FLOPs(no-FMA nobias) | |
mread | |
mwrite | |
params | conv parameters k , s , p , d , groups , nobias |
Deconvolution, also as known as transposed convolution, can be thought as going backward to the convolution operation. Each input pixel is multiplied to convolution filter kernel (). Its result is summed up on the output tensor, among adjacent result and result of other filters (there are filters), then bias is added if exists.
In order to understand the behavior of this operation and why it is called "transposed" convolution, these materials are helpful.
- Up-sampling with Transposed Convolution - Towards Data Science
- Convolution arithmetic tutorial - Theano 1.0.0 documentation
The theoretical computational cost is FLOPs.
In case of groups
is not 1, similarly to convolution, it becomes
FLOPs.
Item | Value |
---|---|
FLOPs(FMA) | |
FLOPs(no-FMA) | |
mread | |
mwrite | |
params | conv parameters k , s , p , d , groups , nobias |
Let be the input size and be the
output size.
Each output value is calculated by product and
sum operations.
So, in case fma_1flop==False
, FLOPs/element,
or if there is bias.
In FMA mode FLOPs (regardress of existence of bias).
FMA option can be switched by fma_1flop: bool
keyword argument specified
to ComputationalCostHook.
Item | Value |
---|---|
FLOPs(FMA) | |
FLOPs(no-FMA) | |
FLOPs(no-FMA no-bias) | |
mread | , where and are learned parameter |
mwrite | |
params | nobias |
Shift only conducts memory read, index calculation and memory write. There might be unnecessary memory read around corners, but for simplicity chainer-computational-cost treats it as just reading the entire data.
Item | Value |
---|---|
FLOPs | |
mread | |
mwrite | |
params | shift parameters k and d |
Elementwise Add operation. In case shape of inputs are different, broadcast will run and then elementwise arithmetic is conducted. Cost for broadcasting is ignored. For simplicity, it assumes all the arrays are first broadcasted to the output size then elementwise sum is calculated. The output size is the largest size of the input.
Item | Value |
---|---|
FLOPs | |
mread | |
mwrite | |
params | N/A |
AddConstant is elementwise Add operation where the operand is a constant (not a chainer.Variable but a numpy.array or a cupy.array).
In case shape of inputs are different, broadcast will run and then
elementwise arithmetic is conducted. Cost for broadcasting is ignored.
The output size is same as the larger one of either the input () or the
operand (<img src="https://latex.codecogs.com/png.latex?%5Cdpi%7B100%7D%20%5Cnormal%20c"/>
).
Item | Value |
---|---|
FLOPs | |
mread | |
mwrite | |
params | N/A |
See the documentation for Add
See the documentation for AddConstant
See the documentation for Add
See the documentation for AddConstant
See the documentation for Add
See the documentation for AddConstant
In case axis
is None
, it just calculates maximum value of a tensor,
which costs simply .
Here, let's consider a 4-dimensional array whose shape is .
When axis
is set to (1, 2)
,
- First it calculates max over the axis 1, which is FLOPs, and this makes a tensor shaped
- Next, max over the original axis 2 is conducted in FLOPs, and a tensor is remained. Total FLOPs is just a sum of above, and the output is .
Therefore, FLOPs is calculated by the following algorithm.
input: x, axes
output: f (FLOPs), s (output size)
s <- x.size
f <- 0
foreach i in axes
d <- x.shape
s <- s / d
f <- f + (d-1)s
Item | Value |
---|---|
FLOPs | See the above explanation |
mread | |
mwrite | See the above explanation |
params | axis |
See the documentation for Max.
Theoretical cost of Argmax is exactly same as Min/Max, except that Argmax can receive only one axis. See the documentation for Max.
Theoretical cost of Argmin is exactly same as Min/Max, except that Argmax can receive only one axis. See the documentation for Max.
Sum of an array among the specified axis(axes) also costs equivalently to max operation, since it just replaces by . See the documentation for Max.
Test-mode batch normalization. It consists of normalization part (using and ) and bias part ( and ), both are composed of elementwise scale and shift. However this can actually be fused into single scale and shift operation. Therefore, regardless of existence of bias ( and ), computational cost is always FLOPs.
Since scale-and-shift operation can be done by FMA,
it becomes FLOPs if fma_1flop
is set to True
.
Due to the same reason as explained above, reading learned scale and shift parameter is required only once (not twice) regardless of bias existence. Both are 1-dimensional array with elements.
Item | Value |
---|---|
FLOPs(FMA) | |
FLOPs(no-FMA) | |
mread | |
mwrite | |
params | eps : epsilon for BN |
Let us assume that axis
is channel axis, then each spatial position has
a -dimensional vector.
Cacululation L2-norm of this vector is, with no FMA mode,
first it applies elementwise square in FLOPs and summation
in FLOPs finally sqrt in FLOP, so in total FLOPs.
With FMA mode, FLOPs for square and sum, then FLOP for summing up,
in total FLOPs.
Then is added to the L2-norm and elementwise division is applied in total FLOPs.
Hense, total cost for L2-normalizing an array is FLOPs with no FMA mode, or FLOPs.
Chainer's NormalizeL2 implementation supports axis
to be up to 2
elements, but it's undocumented, so chainer-computational-cost only assumes
that axis is only one dimension.
In the below table, 3-dimensional array with shape is assumed and the axis is channel dimension, but any other shape/axis is the same.
Item | Value |
---|---|
FLOPs(FMA) | when shape of is and axis is 0 |
FLOPs(no-FMA) | when shape of is and axis is 0 |
mread | |
mwrite | |
params | axis |
Let , and be the shape of , so
First all the values are squared ( FLOPs).
Then it is integrated among the channel axis
( FLOPs).
Sum of a local response for a value can be calculated by a subtraction
(in total FLOPs).
Then elementwise multiplication of , addition of and
power by are conducted ( FLOPs, or if
fma_1flop
is set to True
, as and can be done by FMA).
In total, FLOPs will be sum of them, FLOPs with no-FMA and with FMA.
Item | Value |
---|---|
FLOPs(FMA) | |
FLOPs(no-FMA) | |
mread | |
mwrite | |
params | n , k , alpha and beta |
Each output pixel is calculated by averaging elements from the input ( FLOPs). Output size is calculated by chainer.utils.get_conv_outsize.
Item | Value |
---|---|
FLOPs | |
mread | |
mwrite | |
params | AvgPooling parameter k , s and p |
Each output pixel is calculated by taking max of elements from the input ( FLOPs). Output size is calculated by chainer.utils.get_conv_outsize.
Item | Value |
---|---|
FLOPs | |
mread | |
mwrite | |
params | AvgPooling parameter k , s and p |
Upsampling2D only reads the data from memory and writs to the certain
position in the output using indices. Other pixels are filled by 0.
Indices array has always the same shape as the input.
Although its data type is not float but int, since their data size is
usually the same (float32
and int32
), chainer-computational-cost
ignores this difference and considers indices to be same as input.
Item | Value |
---|---|
FLOPs | |
mread | |
mwrite | |
params | Upsampling parameter k , s , p , outsize and cover_all |
Unpooling2D only reads the data from memory and writes to the certain position in the output. Unlike the upsampling2D, it does not use indices and all pixels are filled by corresponding pixels in the input tensor.
Item | Value |
---|---|
FLOPs | |
mread | |
mwrite | |
params | Unpooling parameter k , s , p , outsize and cover_all |