Matrix Multiplication Optimization

C = A × B × A^t + B^t × B^t

A is an upper triangular matrix.

Implementation

Neopt

This is the naive implementation of the algorithm. I transpose A and B with the simple mathematical formula. Because A is an upper triangular matrix, we can skip to the i-th column, where i is the row index.

First I calculate the product of A and B and store it in Result. Again, I can skip columns based on the row since A is upper triangular on the left side.

Afterwards I multiply Result with A^t. Because A is upper triangular, I don't need another matrix to store the result, I can use Result. A is to the right in this operation, but it is transposed, so the exact same optimisation as before.

In the end I calculate B^t × B^t and add it to Result.

Opt

I am using block matrix multiplication for all the multiplications. In addition, I am working with pointers instead of accessing the matrix elements with the index operator. Also, I am storing the sums in a temporary variable instead of directly adding them to the result matrix.

The transpositions are not necessary, I can just acces elements from the matrix in a different order.

Since I am working with blocks, to optimise for upper triangular matrices, instead of skipping to the i-th column, I only skip to it if it is part of the current block.

For the second multiplication (the one with A^t), I need a temporary matrix to store the result. This happens because of the partial sums made by the block matrix algorithm.

I also use loop unrolling. For B^txB^t I unroll the loop by 4 and the procedure is trivial. Considering that k_max - kk is always equal to the block_size, that loop could be removed entirely and the computation manually repeated 40 times to further improve the performance.

For the multiplications that include A, I also did loop unrolling, but the optimisation for upper triangular matrix caused me some problems. Hence I treated sepparately the case when k_min is not divisible by 2(I don't unroll), is divisible by 4(unroll by 4) and is divisible by 2(unroll by 2). A more general solution surely exists, but I couldn't find it.

Blas

The blas implementation is straightforward. I use the following operations from blas:

B <- alpha * A x B
B <- alpha * B x A^t
C <- alpha * B^t x B^t + beta * C

Usage:

I move B to Result to use it as a sink using a blas function. I usec dtrmm for the computation because A is upper triangular. I mark A as being on the left side and alpha as 1. Also A is not a unit matrix.
Same as 1, but I mark A as being on the right side and to use the transpose form of A.
I use dgemm since that is the usual matrix multiplication function. Both Bs are marked to use the transposed form and as being normal matrices. alpha and beta are both set as 1.

Results

Time

As we can see from the graphs, the naive implementation is the slowest. A manual computation of the data in the graph shows that M_size[i+1]/M_size[i]=cubic_root(time[i+1])/cubic_root(time[i]). That means that the time complexity is O(n³).

The optimised version is faster by a significant amount, but the blas implementation is the fastest by far. These result are to be expected. Again, a manual computation of the data shows that the complexity is still O(n³).

Memory is the main bottle neck, so my optimisations are very effective, but there is only so much I can do without changing the algorithm. A manual computation reveals that the complexity is, I think, still O(n³).

Blas is the fastest because it is a highly optimised library that uses vectorial operations. It probably uses all the optimisations the compiler with O3 would use, but it also has a lot of other manual optimisations.

Cachegrind

As expected neopt has a lot of cache misses.

The optimised version has a lot less. That happens because the main optimisation used is block matrix multiplication which is designed to reduce the number of cache misses.

I noticed that if I tried to do the addition with B^t×B^t and A^t × B^t in the same loop, the time to run increased, which is probably caused by a lot of extra cache misses.

Surprinzigly, the blas implementation has a bit less data accesses, a bit less first level cache misses, but a bit more last level cache misses. The conclusion that can be drawn is that the first 2 are more important and increase the performance.

I can also see that blass has a lot less branch misspredicts and branches over all. This fact dramatically improves the performance. Also the number of instruction cache references is a lot lower.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
cache		cache
grafice		grafice
memory		memory
src		src
tests		tests
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Matrix Multiplication Optimization

Implementation

Neopt

Opt

Blas

Results

Time

Cachegrind

About

Releases

Packages

Languages

Zeyt8/Matrix-Multiplication-Optimisations

Folders and files

Latest commit

History

Repository files navigation

Matrix Multiplication Optimization

Implementation

Neopt

Opt

Blas

Results

Time

Cachegrind

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages