Skip to content

Commit 21e96c4

Browse files
author
sarah
committed
update paper
1 parent 25947d1 commit 21e96c4

File tree

1 file changed

+40
-12
lines changed

1 file changed

+40
-12
lines changed

paper.md

Lines changed: 40 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -17,53 +17,79 @@ bibliography: paper.bib
1717

1818
# Summary
1919

20-
`faer` is a high performance dense linear algebra library written in Rust.
20+
`faer` is a portable high performance dense linear algebra library written in Rust.
2121
The library offers a convenient high level API for performing matrix
2222
decompositions and solving linear systems. This API is built on top of
2323
a lower level API that gives the user more control over the memory allocation
2424
and multithreading settings.
2525

26+
Supported platforms include the ones supported by Rust.
27+
Explicit SIMD instructions are currently used for x86-64 and Aarch64 (NEON),
28+
with plans for SVE/SME and RVV optimizations once intrinsics for those are stabilized in Rust,
29+
possibly earlier than that if we allow usage of a JIT backend[^1].
30+
2631
The library provides a `Mat` type, allowing for quick and simple construction
2732
and manipulation of matrices, as well as lightweight view types `MatRef` and
2833
`MatMut` for building memory views over existing data.
2934

35+
These views are currently used to represent different kinds of matrices,
36+
such as generic rectangular matrices, symmetric/Hermitian/triangular
37+
(where only half of the matrix is stored) square matrices.
38+
In the future, we plan to make use of the robust Rust type-system to better
39+
express the properties of those matrices, and prevent accidental misuse of the library's API.
40+
3041
Multiple scalar types are supported, and the library code is generic over the
31-
data type. Native floating point types `f32`, `f64`, `c32`, and `c64` are
42+
data type. Native floating point types `f32`, `f64`[^2], `c32`, and `c64` are
3243
supported out of the box, as well as any user-defined types that satisfy the
33-
requested interface.
44+
requested interface, such as extended precision real numbers (double-double or multi-precision floats),
45+
complex numbers using the aforementioned types as the base element, dual/hyper-dual numbers[^3]
46+
47+
48+
[^1]: Inline assembly is not entirely appropriate for our use case since it's hard to make it generic enough for all the operations and types that we wish to support.
49+
[^2]: IEEE 754-2008, with no implicit `fusedMultiplyAdd` contractions and with slight differences around NaN handling. See the [float semantics](https://github.com/rust-lang/rfcs/pull/3514) RFC for more information.
50+
[^3]: These support at least for the simpler matrix decompositions (Cholesky, LU, QR). It's not clear yet how to handle iterative algorithms like the SVD and Eigendecomposition.
3451

3552
# Statement of need
3653

3754
Rust was chosen as a language for the library since it allows full control
3855
over the memory layout of data and exposes low level CPU intrinsics for
39-
SIMD computations. Additionally, its memory safety features make it a
56+
SIMD[^4] computations. Additionally, its memory safety features make it a
4057
perfect candidate for writing efficient and parallel code, since the compiler
4158
statically checks for errors that are common in other low level languages,
4259
such as data races and fatal use-after-free errors.
4360

4461
Rust also allows compatibility with the C ABI, allowing for simple interoperability
45-
with C, and most other languages by extension.
62+
with C, and most other languages by extension. Once a design has been properly fleshed out,
63+
we plan to expose a C API, along with bindings to other languages (Currently planned are C, C++, Python and Julia bindings).
4664

4765
Aside from `faer`, the Rust ecosystem lacks high performance matrix factorization
4866
libraries that aren't C library wrappers, which presents a distribution
4967
challenge and can impede generic programming.
5068

69+
[^4]: Single instruction, multiple data operations that CPUs can use to parallelize data processing at the instruction level.
70+
5171
# Features
5272

5373
`faer` exposes a central `Entity` trait that allows users to describe how their
5474
data should be laid out in memory. For example, native floating point types are
55-
laid out contiguously in memory to make use of SIMD features, while complex types
56-
have the option of either being laid out contiguously or in a split format. The latter is also
57-
called a zomplex data type in CHOLMOD (@cholmod). An example of
58-
a type that benefits immensely from this is the double-double type, which is
75+
laid out contiguously in memory to make use of SIMD instructions that prefer this layout,
76+
while complex types have the option of either being laid out contiguously or in a split format.
77+
The latter is also called a zomplex data type in CHOLMOD (@cholmod).
78+
An example of a type that benefits immensely from this is the double-double type, which is
5979
composed of two `f64` components, stored in separate containers. This separate
6080
storage scheme allows us to load each chunk individually to a SIMD register,
6181
opening new avenues for generic vectorization.
6282

6383
The library generically implements algorithms for matrix multiplication, based
6484
on the approach of @BLIS1. For native types, `faer` uses explicit SIMD
65-
depending on the detected CPU features. The library then uses matrix
66-
multiplication as a building block to implement commonly used matrix
85+
depending on the detected CPU features, that dispatch to several precompiled
86+
variants for operations that can make use of these features.
87+
An interesting alternative would be to compile the code Just-in-Time, which could improve compilation times and reduce binary size.
88+
But there are also possible downsides that have to be weighed against these advantages,
89+
such as increasing the startup time to optimize and assemble the code,
90+
as well as the gap in maturity between ahead-of-time compilation (currently backed by LLVM),
91+
and just-in-time compilation, for which the Rust ecosystem is still developing.
92+
The library then uses matrix multiplication as a building block to implement commonly used matrix
6793
decompositions, based on state of the art algorithms in order to guarantee
6894
numerical robustness:
6995
- Cholesky (LLT, LDLT and Bunch-Kaufman LDLT),
@@ -73,7 +99,7 @@ numerical robustness:
7399
- eigenvalue decomposition (with or without eigenvectors).
74100

75101
For algorithms that are memory-bound and don't make much use of matrix multiplication,
76-
`faer` uses optimized fused kernels. This can immensely improve the performance of the
102+
`faer` uses optimized fused kernels[^5]. This can immensely improve the performance of the
77103
QR decomposition with column pivoting, the LU decomposition with full pivoting,
78104
as well as the reduction to condensed form to prepare matrices for the SVD or
79105
eigenvalue decomposition, as described by @10.1145/2382585.2382587.
@@ -86,6 +112,8 @@ To achieve high performance parallelism, `faer` uses the Rayon library (@rayon)
86112
backend, and has shown to be competitive with other frameworks such as OpenMP (@chandra2001parallel)
87113
and Intel Thread Building Blocks (@tbb).
88114

115+
[^5]: For example, computing $A x$ and $A.T y$ with a single pass over $A$, rather than two.
116+
89117
# Performance
90118

91119
Here we present the benchmarks for a representative subset of operations that

0 commit comments

Comments
 (0)