Details of theoretical computational costs

This document explains how chainer-computational-cost estimates theoretical computational cost of each type of layer. Unless otherwise specified, stands for the input to the layer and is output. is the number of elements in (equivalent to x.size), if is empty or does not exist, .

The basic strategy of how the "theoretical computational cost" is defined is written in README.md.

Activation

PReLUFunction

Max operation can be executed by a single instruction. chainer-computational-cost considers it as one floating point operation.

In case is neither scalar nor same shape as , it is broadcasted internally, but cost for broadcasting operation is ignored.

Item	Value
FLOPs
mread	, where is learned parameter
mwrite
params	`w_shape`: shape of

ReLU

Max operation can be executed by a single instruction. chainer-computational-cost considers it as one floating point operation.

Item	Value
FLOPs
mread
mwrite
params	N/A

LeakyReLU

Leaky ReLU applies an elementwise multiplication with slope gradient followed by a max operation. Strictly speaking, the number of multiplication to be conducted is depend on the input data, but for simplicity, chainer-computational-cost treats Leaky ReLU as .

Item	Value
FLOPs
mread
mwrite
params	N/A

Sigmoid

Sigmoid is an elementwise operation that consists of four floating point operations, which are minus, exp, +1 and inv, for each element.

Item	Value
FLOPs
mread
mwrite
params	N/A

Softmax

Here, let be the depth of the input over the specified axis (c = x.shape[axis]) and be the product of spatial size (size of without axis axis).

In Softmax, at first exp is calculated for each element, that is FLOPs. Then they are summed over the axis axis, which requires FLOPs for each spatial position ( FLOPs). Finally each elements are devided by the sum ( FLOPs).

Item	Value
FLOPs
mread
mwrite
params	`axis`: axis of the softmax operation

Array

BroadcastTo

As index calculation is ignored in chainer-computational-cost, broadcasting is theoretically 0 FLOPs.

Item	Value
FLOPs
mread
mwrite
params	`shape`: output shape

Concat

Concatenation is just a memory copy, hense no floating point operation will be conducted. Depending on concatenation axis, index calculation might be needed but chainer-computational-cost ignores such operations.

Item	Value
FLOPs
mread
mwrite
params	`axis`: concatenation axis

Reshape

Reshape operation basically neither changes nor reads the data. It just makes an array with same data reference with different metadata.

If your environment cannot do in-place reshape, consider overwriting by a custom cost calculator (refer README.md).

Item	Value
FLOPs
mread
mwrite
params	`in_shape`: input shape, `out_shape`: output shape

ResizeImages

In bilinear resize operation, each output pixel value is calculated by 4 neighboring source pixels in the input image.

In order to know where the source pixel is, a few index calculations including floating point arithmetic is needed, but these are ignored since chainer-computational-cost ignores such index calculations.

To calculate an output value from 4 source pixels, first 3 FLOPs is spent for horizontal interpolation from 2 source pixels, then another 3 FLOPs for vertical, and finally 3 FLOPs for inter-axis interpolation, therefore in total 9 FLOPs for each output pixel.

As for memory access, especially in case of expanding the source image, duplicated memory read will happen. For example, if the input image is 8x8 and the output size is 32x32, naively reading memory runs 4096 times, even though the actual size of the input is only 64. In order to avoid such a contradiction, chainer-computational-cost introduces a policy to treat such case as if it loads the entire input data only once.

Conversely, in case not all the image is needed (for example input is 32x32 and the output is 8x8, where memory read is only 128), chainer-computational-cost simply counts 4 memory reads for each output pixel.

Either smaller number is used for memory read estimation. In other words, memory read is formulated as .

Considering extreme cases like shrinking horizontally and expanding vertically, this logic should be much complicated, but for simplicity chainer-computational-cost only uses the above formula.

Item	Value
FLOPs
mread
mwrite
params	`size`: output size

Transpose

Transpose operation is just copying memory with no FLOPs.

Item	Value
FLOPs
mread
mwrite
params	`axes`: transpose axes

GetItem

Extract part of an array. This operation is zero FLOPs. Most of tensor libraries have view feature, which doesn't actually create a new array unless necessary, but this is not considered in chainer-computational-cost. Memory read runs only for the necessary elements, so both memory read and write are same as the size of output tensor.

Item	Value
FLOPs
mread
mwrite
params	`slices`: list of slices, a slice is an int or a tuple with 3 elements

Connection

Convolution2DFunction

Convolution operator essentially calculates an output value by convolving the input feature map by a corresponding filter whose size is . The computational cost is FLOPs for each output pixel. Including bias, it becomes simply . So in total . Here, output size and can be calculated by chainer.utils.get_conv_outsize.

If there is no bias, it will be FLOPs.

In case of grouped convolution, it can be considered as just concatenating result of convolution whose input has channels and output is channels. Each group costs FLOPs/group, so in total FLOPs.

If fma_1flop is set to True, it will be FLOPs. Exsistence of bias does not matter this case.

Dilated convolution does not change theoretical computational cost explained above, although it usually significantly affects to the actual performance.

Although a source pixel can be read multiple times in the most native convolution implementation, chainer-computational-cost counts them only once, therefore the memory read is counted as if the entire input data and parameters (weights, biases) are loaded from memory only at once. Padding is ignored, too.

FMA option can be switched by fma_1flop: bool keyword argument specified to ComputationalCostHook.

Item	Value
FLOPs(FMA)
FLOPs(no-FMA)
FLOPs(no-FMA nobias)
mread
mwrite
params	conv parameters `k`, `s`, `p`, `d`, `groups`, `nobias`

Deconvolution2DFunction

Deconvolution, also as known as transposed convolution, can be thought as going backward to the convolution operation. Each input pixel is multiplied to convolution filter kernel (). Its result is summed up on the output tensor, among adjacent result and result of other filters (there are filters), then bias is added if exists.

In order to understand the behavior of this operation and why it is called "transposed" convolution, these materials are helpful.

Up-sampling with Transposed Convolution - Towards Data Science
Convolution arithmetic tutorial - Theano 1.0.0 documentation

The theoretical computational cost is FLOPs.

In case of groups is not 1, similarly to convolution, it becomes FLOPs.

Item	Value
FLOPs(FMA)
FLOPs(no-FMA)
mread
mwrite
params	conv parameters `k`, `s`, `p`, `d`, `groups`, `nobias`

LinearFunction

Let be the input size and be the output size. Each output value is calculated by product and sum operations. So, in case fma_1flop==False, FLOPs/element, or if there is bias. In FMA mode FLOPs (regardress of existence of bias).

FMA option can be switched by fma_1flop: bool keyword argument specified to ComputationalCostHook.

Item	Value
FLOPs(FMA)
FLOPs(no-FMA)
FLOPs(no-FMA no-bias)
mread	, where and are learned parameter
mwrite
params	`nobias`

Shift

Shift only conducts memory read, index calculation and memory write. There might be unnecessary memory read around corners, but for simplicity chainer-computational-cost treats it as just reading the entire data.

Item	Value
FLOPs
mread
mwrite
params	shift parameters `k` and `d`

Math

Add

Elementwise Add operation. In case shape of inputs are different, broadcast will run and then elementwise arithmetic is conducted. Cost for broadcasting is ignored. For simplicity, it assumes all the arrays are first broadcasted to the output size then elementwise sum is calculated. The output size is the largest size of the input.

Item	Value
FLOPs
mread
mwrite
params	N/A

AddConstant

AddConstant is elementwise Add operation where the operand is a constant (not a chainer.Variable but a numpy.array or a cupy.array).

In case shape of inputs are different, broadcast will run and then elementwise arithmetic is conducted. Cost for broadcasting is ignored. The output size is same as the larger one of either the input () or the operand (<img src="https://latex.codecogs.com/png.latex?%5Cdpi%7B100%7D%20%5Cnormal%20c"/>).

Item	Value
FLOPs
mread
mwrite
params	N/A

Div

See the documentation for Add

DivFromConstant

See the documentation for AddConstant

Mul

See the documentation for Add

MulConstant

See the documentation for AddConstant

Sub

See the documentation for Add

SubFromConstant

See the documentation for AddConstant

Max

In case axis is None, it just calculates maximum value of a tensor, which costs simply .

Here, let's consider a 4-dimensional array whose shape is . When axis is set to (1, 2),

First it calculates max over the axis 1, which is FLOPs, and this makes a tensor shaped
Next, max over the original axis 2 is conducted in FLOPs, and a tensor is remained. Total FLOPs is just a sum of above, and the output is .

Therefore, FLOPs is calculated by the following algorithm.

input: x, axes
output: f (FLOPs), s (output size)
s <- x.size
f <- 0
foreach i in axes
    d <- x.shape
    s <- s / d
    f <- f + (d-1)s

Item	Value
FLOPs	See the above explanation
mread
mwrite	See the above explanation
params	`axis`

Min

See the documentation for Max.

ArgMax

Theoretical cost of Argmax is exactly same as Min/Max, except that Argmax can receive only one axis. See the documentation for Max.

ArgMin

Theoretical cost of Argmin is exactly same as Min/Max, except that Argmax can receive only one axis. See the documentation for Max.

Sum

Sum of an array among the specified axis(axes) also costs equivalently to max operation, since it just replaces by . See the documentation for Max.

Normalization

FixedBatchNormalization

Test-mode batch normalization. It consists of normalization part (using and ) and bias part ( and ), both are composed of elementwise scale and shift. However this can actually be fused into single scale and shift operation. Therefore, regardless of existence of bias ( and ), computational cost is always FLOPs.

Since scale-and-shift operation can be done by FMA, it becomes FLOPs if fma_1flop is set to True.

Due to the same reason as explained above, reading learned scale and shift parameter is required only once (not twice) regardless of bias existence. Both are 1-dimensional array with elements.

Item	Value
FLOPs(FMA)
FLOPs(no-FMA)
mread
mwrite
params	`eps`: epsilon for BN

NormalizeL2

Let us assume that axis is channel axis, then each spatial position has a -dimensional vector. Cacululation L2-norm of this vector is, with no FMA mode, first it applies elementwise square in FLOPs and summation in FLOPs finally sqrt in FLOP, so in total FLOPs. With FMA mode, FLOPs for square and sum, then FLOP for summing up, in total FLOPs.

Then is added to the L2-norm and elementwise division is applied in total FLOPs.

Hense, total cost for L2-normalizing an array is FLOPs with no FMA mode, or FLOPs.

Chainer's NormalizeL2 implementation supports axis to be up to 2 elements, but it's undocumented, so chainer-computational-cost only assumes that axis is only one dimension.

In the below table, 3-dimensional array with shape is assumed and the axis is channel dimension, but any other shape/axis is the same.

Item	Value
FLOPs(FMA)	when shape of is and `axis` is 0
FLOPs(no-FMA)	when shape of is and `axis` is 0
mread
mwrite
params	`axis`

LocalResponseNormalization

Let , and be the shape of , so First all the values are squared ( FLOPs). Then it is integrated among the channel axis ( FLOPs). Sum of a local response for a value can be calculated by a subtraction (in total FLOPs). Then elementwise multiplication of , addition of and power by are conducted ( FLOPs, or if fma_1flop is set to True, as and can be done by FMA).

In total, FLOPs will be sum of them, FLOPs with no-FMA and with FMA.

Item	Value
FLOPs(FMA)
FLOPs(no-FMA)
mread
mwrite
params	`n`, `k`, `alpha` and `beta`

Pooling

AveragePooling2D

Each output pixel is calculated by averaging elements from the input ( FLOPs). Output size is calculated by chainer.utils.get_conv_outsize.

Item	Value
FLOPs
mread
mwrite
params	AvgPooling parameter `k`, `s` and `p`

MaxPooling2D

Each output pixel is calculated by taking max of elements from the input ( FLOPs). Output size is calculated by chainer.utils.get_conv_outsize.

Item	Value
FLOPs
mread
mwrite
params	AvgPooling parameter `k`, `s` and `p`

Upsampling2D

Upsampling2D only reads the data from memory and writs to the certain position in the output using indices. Other pixels are filled by 0. Indices array has always the same shape as the input. Although its data type is not float but int, since their data size is usually the same (float32 and int32), chainer-computational-cost ignores this difference and considers indices to be same as input.

Item	Value
FLOPs
mread
mwrite
params	Upsampling parameter `k`, `s`, `p`, `outsize` and `cover_all`

Unpooling2D

Unpooling2D only reads the data from memory and writes to the certain position in the output. Unlike the upsampling2D, it does not use indices and all pixels are filled by corresponding pixels in the input tensor.

Item	Value
FLOPs
mread
mwrite
params	Unpooling parameter `k`, `s`, `p`, `outsize` and `cover_all`

Files

DETAILS.md

Latest commit

History

DETAILS.md

File metadata and controls

Details of theoretical computational costs

Table of contents

Activation

Array

Connection

Math

Normalization

Pooling