You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: paper.md
+40-12Lines changed: 40 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,53 +17,79 @@ bibliography: paper.bib
17
17
18
18
# Summary
19
19
20
-
`faer` is a high performance dense linear algebra library written in Rust.
20
+
`faer` is a portable high performance dense linear algebra library written in Rust.
21
21
The library offers a convenient high level API for performing matrix
22
22
decompositions and solving linear systems. This API is built on top of
23
23
a lower level API that gives the user more control over the memory allocation
24
24
and multithreading settings.
25
25
26
+
Supported platforms include the ones supported by Rust.
27
+
Explicit SIMD instructions are currently used for x86-64 and Aarch64 (NEON),
28
+
with plans for SVE/SME and RVV optimizations once intrinsics for those are stabilized in Rust,
29
+
possibly earlier than that if we allow usage of a JIT backend[^1].
30
+
26
31
The library provides a `Mat` type, allowing for quick and simple construction
27
32
and manipulation of matrices, as well as lightweight view types `MatRef` and
28
33
`MatMut` for building memory views over existing data.
29
34
35
+
These views are currently used to represent different kinds of matrices,
36
+
such as generic rectangular matrices, symmetric/Hermitian/triangular
37
+
(where only half of the matrix is stored) square matrices.
38
+
In the future, we plan to make use of the robust Rust type-system to better
39
+
express the properties of those matrices, and prevent accidental misuse of the library's API.
40
+
30
41
Multiple scalar types are supported, and the library code is generic over the
31
-
data type. Native floating point types `f32`, `f64`, `c32`, and `c64` are
42
+
data type. Native floating point types `f32`, `f64`[^2], `c32`, and `c64` are
32
43
supported out of the box, as well as any user-defined types that satisfy the
33
-
requested interface.
44
+
requested interface, such as extended precision real numbers (double-double or multi-precision floats),
45
+
complex numbers using the aforementioned types as the base element, dual/hyper-dual numbers[^3]
46
+
47
+
48
+
[^1]: Inline assembly is not entirely appropriate for our use case since it's hard to make it generic enough for all the operations and types that we wish to support.
49
+
[^2]: IEEE 754-2008, with no implicit `fusedMultiplyAdd` contractions and with slight differences around NaN handling. See the [float semantics](https://github.com/rust-lang/rfcs/pull/3514) RFC for more information.
50
+
[^3]: These support at least for the simpler matrix decompositions (Cholesky, LU, QR). It's not clear yet how to handle iterative algorithms like the SVD and Eigendecomposition.
34
51
35
52
# Statement of need
36
53
37
54
Rust was chosen as a language for the library since it allows full control
38
55
over the memory layout of data and exposes low level CPU intrinsics for
39
-
SIMD computations. Additionally, its memory safety features make it a
56
+
SIMD[^4] computations. Additionally, its memory safety features make it a
40
57
perfect candidate for writing efficient and parallel code, since the compiler
41
58
statically checks for errors that are common in other low level languages,
42
59
such as data races and fatal use-after-free errors.
43
60
44
61
Rust also allows compatibility with the C ABI, allowing for simple interoperability
45
-
with C, and most other languages by extension.
62
+
with C, and most other languages by extension. Once a design has been properly fleshed out,
63
+
we plan to expose a C API, along with bindings to other languages (Currently planned are C, C++, Python and Julia bindings).
46
64
47
65
Aside from `faer`, the Rust ecosystem lacks high performance matrix factorization
48
66
libraries that aren't C library wrappers, which presents a distribution
49
67
challenge and can impede generic programming.
50
68
69
+
[^4]: Single instruction, multiple data operations that CPUs can use to parallelize data processing at the instruction level.
70
+
51
71
# Features
52
72
53
73
`faer` exposes a central `Entity` trait that allows users to describe how their
54
74
data should be laid out in memory. For example, native floating point types are
55
-
laid out contiguously in memory to make use of SIMD features, while complex types
56
-
have the option of either being laid out contiguously or in a split format. The latter is also
57
-
called a zomplex data type in CHOLMOD (@cholmod). An example of
58
-
a type that benefits immensely from this is the double-double type, which is
75
+
laid out contiguously in memory to make use of SIMD instructions that prefer this layout,
76
+
while complex types have the option of either being laid out contiguously or in a split format.
77
+
The latter is also called a zomplex data type in CHOLMOD (@cholmod).
78
+
An example of a type that benefits immensely from this is the double-double type, which is
59
79
composed of two `f64` components, stored in separate containers. This separate
60
80
storage scheme allows us to load each chunk individually to a SIMD register,
61
81
opening new avenues for generic vectorization.
62
82
63
83
The library generically implements algorithms for matrix multiplication, based
64
84
on the approach of @BLIS1. For native types, `faer` uses explicit SIMD
65
-
depending on the detected CPU features. The library then uses matrix
66
-
multiplication as a building block to implement commonly used matrix
85
+
depending on the detected CPU features, that dispatch to several precompiled
86
+
variants for operations that can make use of these features.
87
+
An interesting alternative would be to compile the code Just-in-Time, which could improve compilation times and reduce binary size.
88
+
But there are also possible downsides that have to be weighed against these advantages,
89
+
such as increasing the startup time to optimize and assemble the code,
90
+
as well as the gap in maturity between ahead-of-time compilation (currently backed by LLVM),
91
+
and just-in-time compilation, for which the Rust ecosystem is still developing.
92
+
The library then uses matrix multiplication as a building block to implement commonly used matrix
67
93
decompositions, based on state of the art algorithms in order to guarantee
68
94
numerical robustness:
69
95
- Cholesky (LLT, LDLT and Bunch-Kaufman LDLT),
@@ -73,7 +99,7 @@ numerical robustness:
73
99
- eigenvalue decomposition (with or without eigenvectors).
74
100
75
101
For algorithms that are memory-bound and don't make much use of matrix multiplication,
76
-
`faer` uses optimized fused kernels. This can immensely improve the performance of the
102
+
`faer` uses optimized fused kernels[^5]. This can immensely improve the performance of the
77
103
QR decomposition with column pivoting, the LU decomposition with full pivoting,
78
104
as well as the reduction to condensed form to prepare matrices for the SVD or
79
105
eigenvalue decomposition, as described by @10.1145/2382585.2382587.
@@ -86,6 +112,8 @@ To achieve high performance parallelism, `faer` uses the Rayon library (@rayon)
86
112
backend, and has shown to be competitive with other frameworks such as OpenMP (@chandra2001parallel)
87
113
and Intel Thread Building Blocks (@tbb).
88
114
115
+
[^5]: For example, computing $A x$ and $A.T y$ with a single pass over $A$, rather than two.
116
+
89
117
# Performance
90
118
91
119
Here we present the benchmarks for a representative subset of operations that
0 commit comments