Skip to content

Latest commit

 

History

History
647 lines (475 loc) · 40.2 KB

DETAILS.md

File metadata and controls

647 lines (475 loc) · 40.2 KB

Details of theoretical computational costs

This document explains how chainer-computational-cost estimates theoretical computational cost of each type of layer. Unless otherwise specified, stands for the input to the layer and is output. is the number of elements in (equivalent to x.size), if is empty or does not exist, .

The basic strategy of how the "theoretical computational cost" is defined is written in README.md.

Table of contents

Activation

Max operation can be executed by a single instruction. chainer-computational-cost considers it as one floating point operation.

In case is neither scalar nor same shape as , it is broadcasted internally, but cost for broadcasting operation is ignored.

Item Value
FLOPs
mread , where is learned parameter
mwrite
params w_shape: shape of

Max operation can be executed by a single instruction. chainer-computational-cost considers it as one floating point operation.

Item Value
FLOPs
mread
mwrite
params N/A

Leaky ReLU applies an elementwise multiplication with slope gradient followed by a max operation. Strictly speaking, the number of multiplication to be conducted is depend on the input data, but for simplicity, chainer-computational-cost treats Leaky ReLU as .

Item Value
FLOPs
mread
mwrite
params N/A

Sigmoid is an elementwise operation that consists of four floating point operations, which are minus, exp, +1 and inv, for each element.

Item Value
FLOPs
mread
mwrite
params N/A

Here, let be the depth of the input over the specified axis (c = x.shape[axis]) and be the product of spatial size (size of without axis axis).

In Softmax, at first exp is calculated for each element, that is FLOPs. Then they are summed over the axis axis, which requires FLOPs for each spatial position ( FLOPs). Finally each elements are devided by the sum ( FLOPs).

Item Value
FLOPs
mread
mwrite
params axis: axis of the softmax operation

Array

As index calculation is ignored in chainer-computational-cost, broadcasting is theoretically 0 FLOPs.

Item Value
FLOPs
mread
mwrite
params shape: output shape

Concatenation is just a memory copy, hense no floating point operation will be conducted. Depending on concatenation axis, index calculation might be needed but chainer-computational-cost ignores such operations.

Item Value
FLOPs
mread
mwrite
params axis: concatenation axis

Reshape operation basically neither changes nor reads the data. It just makes an array with same data reference with different metadata.

If your environment cannot do in-place reshape, consider overwriting by a custom cost calculator (refer README.md).

Item Value
FLOPs
mread
mwrite
params in_shape: input shape, out_shape: output shape

In bilinear resize operation, each output pixel value is calculated by 4 neighboring source pixels in the input image.

In order to know where the source pixel is, a few index calculations including floating point arithmetic is needed, but these are ignored since chainer-computational-cost ignores such index calculations.

To calculate an output value from 4 source pixels, first 3 FLOPs is spent for horizontal interpolation from 2 source pixels, then another 3 FLOPs for vertical, and finally 3 FLOPs for inter-axis interpolation, therefore in total 9 FLOPs for each output pixel.

As for memory access, especially in case of expanding the source image, duplicated memory read will happen. For example, if the input image is 8x8 and the output size is 32x32, naively reading memory runs 4096 times, even though the actual size of the input is only 64. In order to avoid such a contradiction, chainer-computational-cost introduces a policy to treat such case as if it loads the entire input data only once.

Conversely, in case not all the image is needed (for example input is 32x32 and the output is 8x8, where memory read is only 128), chainer-computational-cost simply counts 4 memory reads for each output pixel.

Either smaller number is used for memory read estimation. In other words, memory read is formulated as .

Considering extreme cases like shrinking horizontally and expanding vertically, this logic should be much complicated, but for simplicity chainer-computational-cost only uses the above formula.

Item Value
FLOPs
mread
mwrite
params size: output size

Transpose operation is just copying memory with no FLOPs.

Item Value
FLOPs
mread
mwrite
params axes: transpose axes

Extract part of an array. This operation is zero FLOPs. Most of tensor libraries have view feature, which doesn't actually create a new array unless necessary, but this is not considered in chainer-computational-cost. Memory read runs only for the necessary elements, so both memory read and write are same as the size of output tensor.

Item Value
FLOPs
mread
mwrite
params slices: list of slices, a slice is an int or a tuple with 3 elements

Connection

Convolution operator essentially calculates an output value by convolving the input feature map by a corresponding filter whose size is . The computational cost is FLOPs for each output pixel. Including bias, it becomes simply . So in total . Here, output size and can be calculated by chainer.utils.get_conv_outsize.

If there is no bias, it will be FLOPs.

In case of grouped convolution, it can be considered as just concatenating result of convolution whose input has channels and output is channels. Each group costs FLOPs/group, so in total FLOPs.

If fma_1flop is set to True, it will be FLOPs. Exsistence of bias does not matter this case.

Dilated convolution does not change theoretical computational cost explained above, although it usually significantly affects to the actual performance.

Although a source pixel can be read multiple times in the most native convolution implementation, chainer-computational-cost counts them only once, therefore the memory read is counted as if the entire input data and parameters (weights, biases) are loaded from memory only at once. Padding is ignored, too.

FMA option can be switched by fma_1flop: bool keyword argument specified to ComputationalCostHook.

Item Value
FLOPs(FMA)
FLOPs(no-FMA)
FLOPs(no-FMA nobias)
mread
mwrite
params conv parameters k, s, p, d, groups, nobias

Deconvolution, also as known as transposed convolution, can be thought as going backward to the convolution operation. Each input pixel is multiplied to convolution filter kernel (). Its result is summed up on the output tensor, among adjacent result and result of other filters (there are filters), then bias is added if exists.

In order to understand the behavior of this operation and why it is called "transposed" convolution, these materials are helpful.

The theoretical computational cost is FLOPs.

In case of groups is not 1, similarly to convolution, it becomes FLOPs.

Item Value
FLOPs(FMA)
FLOPs(no-FMA)
mread
mwrite
params conv parameters k, s, p, d, groups, nobias

Let be the input size and be the output size. Each output value is calculated by product and sum operations. So, in case fma_1flop==False, FLOPs/element, or if there is bias. In FMA mode FLOPs (regardress of existence of bias).

FMA option can be switched by fma_1flop: bool keyword argument specified to ComputationalCostHook.

Item Value
FLOPs(FMA)
FLOPs(no-FMA)
FLOPs(no-FMA no-bias)
mread , where and are learned parameter
mwrite
params nobias

Shift only conducts memory read, index calculation and memory write. There might be unnecessary memory read around corners, but for simplicity chainer-computational-cost treats it as just reading the entire data.

Item Value
FLOPs
mread
mwrite
params shift parameters k and d

Math

Elementwise Add operation. In case shape of inputs are different, broadcast will run and then elementwise arithmetic is conducted. Cost for broadcasting is ignored. For simplicity, it assumes all the arrays are first broadcasted to the output size then elementwise sum is calculated. The output size is the largest size of the input.

Item Value
FLOPs
mread
mwrite
params N/A

AddConstant is elementwise Add operation where the operand is a constant (not a chainer.Variable but a numpy.array or a cupy.array).

In case shape of inputs are different, broadcast will run and then elementwise arithmetic is conducted. Cost for broadcasting is ignored. The output size is same as the larger one of either the input () or the operand (<img src="https://latex.codecogs.com/png.latex?%5Cdpi%7B100%7D%20%5Cnormal%20c"/>).

Item Value
FLOPs
mread
mwrite
params N/A

See the documentation for Add

See the documentation for AddConstant

See the documentation for Add

See the documentation for AddConstant

See the documentation for Add

See the documentation for AddConstant

In case axis is None, it just calculates maximum value of a tensor, which costs simply .

Here, let's consider a 4-dimensional array whose shape is . When axis is set to (1, 2),

  • First it calculates max over the axis 1, which is FLOPs, and this makes a tensor shaped
  • Next, max over the original axis 2 is conducted in FLOPs, and a tensor is remained. Total FLOPs is just a sum of above, and the output is .

Therefore, FLOPs is calculated by the following algorithm.

input: x, axes
output: f (FLOPs), s (output size)
s <- x.size
f <- 0
foreach i in axes
    d <- x.shape
    s <- s / d
    f <- f + (d-1)s
Item Value
FLOPs See the above explanation
mread
mwrite See the above explanation
params axis

See the documentation for Max.

Theoretical cost of Argmax is exactly same as Min/Max, except that Argmax can receive only one axis. See the documentation for Max.

Theoretical cost of Argmin is exactly same as Min/Max, except that Argmax can receive only one axis. See the documentation for Max.

Sum of an array among the specified axis(axes) also costs equivalently to max operation, since it just replaces by . See the documentation for Max.

Normalization

Test-mode batch normalization. It consists of normalization part (using and ) and bias part ( and ), both are composed of elementwise scale and shift. However this can actually be fused into single scale and shift operation. Therefore, regardless of existence of bias ( and ), computational cost is always FLOPs.

Since scale-and-shift operation can be done by FMA, it becomes FLOPs if fma_1flop is set to True.

Due to the same reason as explained above, reading learned scale and shift parameter is required only once (not twice) regardless of bias existence. Both are 1-dimensional array with elements.

Item Value
FLOPs(FMA)
FLOPs(no-FMA)
mread
mwrite
params eps: epsilon for BN

Let us assume that axis is channel axis, then each spatial position has a -dimensional vector. Cacululation L2-norm of this vector is, with no FMA mode, first it applies elementwise square in FLOPs and summation in FLOPs finally sqrt in FLOP, so in total FLOPs. With FMA mode, FLOPs for square and sum, then FLOP for summing up, in total FLOPs.

Then is added to the L2-norm and elementwise division is applied in total FLOPs.

Hense, total cost for L2-normalizing an array is FLOPs with no FMA mode, or FLOPs.

Chainer's NormalizeL2 implementation supports axis to be up to 2 elements, but it's undocumented, so chainer-computational-cost only assumes that axis is only one dimension.

In the below table, 3-dimensional array with shape is assumed and the axis is channel dimension, but any other shape/axis is the same.

Item Value
FLOPs(FMA) when shape of is and axis is 0
FLOPs(no-FMA) when shape of is and axis is 0
mread
mwrite
params axis

Let , and be the shape of , so First all the values are squared ( FLOPs). Then it is integrated among the channel axis ( FLOPs). Sum of a local response for a value can be calculated by a subtraction (in total FLOPs). Then elementwise multiplication of , addition of and power by are conducted ( FLOPs, or if fma_1flop is set to True, as and can be done by FMA).

In total, FLOPs will be sum of them, FLOPs with no-FMA and with FMA.

Item Value
FLOPs(FMA)
FLOPs(no-FMA)
mread
mwrite
params n, k, alpha and beta

Pooling

Each output pixel is calculated by averaging elements from the input ( FLOPs). Output size is calculated by chainer.utils.get_conv_outsize.

Item Value
FLOPs
mread
mwrite
params AvgPooling parameter k, s and p

Each output pixel is calculated by taking max of elements from the input ( FLOPs). Output size is calculated by chainer.utils.get_conv_outsize.

Item Value
FLOPs
mread
mwrite
params AvgPooling parameter k, s and p

Upsampling2D only reads the data from memory and writs to the certain position in the output using indices. Other pixels are filled by 0. Indices array has always the same shape as the input. Although its data type is not float but int, since their data size is usually the same (float32 and int32), chainer-computational-cost ignores this difference and considers indices to be same as input.

Item Value
FLOPs
mread
mwrite
params Upsampling parameter k, s, p, outsize and cover_all

Unpooling2D only reads the data from memory and writes to the certain position in the output. Unlike the upsampling2D, it does not use indices and all pixels are filled by corresponding pixels in the input tensor.

Item Value
FLOPs
mread
mwrite
params Unpooling parameter k, s, p, outsize and cover_all