64 bits of accuracy gives you ~16 base-10 digits of precision that is, the roundoff error is
there are as many floats from 0-1 as there are from 1-$∞$
(jhg: can you shift computations to be around 0?)
there are fixed point numbers. why not use them? why is this this way?
floating point representation:
[sign] [mantissa] [exponent]
machine epsilon
e = eps(1.)
e, 1. + (e * .5), 1. + (e * .50001)
example: subtraction:
1.512832... - 1.512848... ------------- - 0.000016...
you lose 5 digits of precision!
e = 1e-10rand()
@show e
@show 1 + e
()
computation of derivatives: look at finite differencing:
really neat plot of numerical error as epsilon shrinks
best you can ever do with finite differencing is ~7 base-10 digits of accuracy
this is alarming! with jacobian + factoring, we’re down to 3 digits of accuracy for each step!
we need to never take a small number and add it to a large number. can we do differentiation that keeps big/small in separate dimensions?
simplest multi-dimensional number? a complex number!
okay, let’s say we’ve got
expand out a taylor series in complex numbers:
rearrange, divide h out:
okay,
take
basically, this makes the big and small numbers never interact!
why does this work? we take a step in a direction that’s orthogonal to the real numbers.
okay, can we do this in higher dimensions?
this can be made rigorous with nonstandard analysis, there’s a good book about it
formally set
represent a function and it’s derivative:
now, because we’re basing this on nonstandard analysis, just treat
(jhg: basically a different way of approaching the other AD lecture, using nonstandard analysis instead of dropping higher-order terms of a taylor expansion)
can define operations on dual numbers; can then find derivatives via algebra, discarding
break things down to primitives; can just use
basically: you’re using the compiler to raise (/ lower?) your functions to dual numbers
(jhg: wait, how do you take derivatives of functions in julia again?)
For a function
where
We can directly do this using the same simple Dual numbers as above, using the same
\begin{align*}
f(x_0 + aε, y_0 + bε) &= (x_0 + aε)^2 sin(y_0 + bε)
&= x_0^2 sin(y_0) + ε[2ax_0 sin(y_0) + x_0^2 b cos(y_0)] + o(ε)
\end{align*}
…So we have indeed calculated
To calculate the derivative in a different direction: you could just do this multiple times, but, better way: introduce multiple dimensions, e.g. compute
…then you get derivative in directions
In particular, if we wish to calculate the gradient itself, ∇f(x0,y0), we need to calculate both partial derivatives, which corresponds to two directional derivatives, in the directions (1,0) and (0,1), respectively.
Note that another representation of the directional derivative is
Written out in the standard basis, we have that:
Now write out what
**The primitive action of forward-mode AD is $f’(x)v!**
This is also known as a Jacobian-vector product, or jvp for short.
We can thus represent vector calculus with multidimensional dual numbers as
follows. Let
where
i.e. it calculates the result of
For a function
Then
\begin{align}
f(x + ε v) &= (f_1(x + ε v), \ldots, f_m(x + ε v))
&= (f_1(x) + ε[∇ f_1(x) ⋅ v], …, f_m(x) + ε[∇ f_m(x) ⋅ v] \
&= f(x) + [f’(x) ⋅ v] ε,
\end{align}
To calculate the complete Jacobian, we calculate these directional derivatives
in the
for
computes all columns of the Jacobian simultaneously.
higher-order derivatives: add more epsilons! can make hyperduals! woo! (it’s actually really hard to do algebra once you’re working with that many terms…)
that is the most general version of forward-mode AD
this is a formally correct result, we’re not thinking of this just in terms of “dropping higher order terms”
note: pushing things through to implementations of “primitive functions” can be done because julia has its own libm implementation! not as good as some proprietary ones (.5x performance), but makes things reproducible, even across hardware platforms like GPU
forward-mode AD at compile time is solved, reverse-mode AD is much harder
[we can think of this as selecting special
we have previously shown how to solve non-stiff odes via optimized runge-kutta methods, but we ended by showing that there is a fundamental limitation of these methods when attempting to solve stiff ordinary differential equations. however, we can get around these limitations by using different types of methods, like implicit euler. let’s now go down the path of understanding how to efficiently implement stiff ordinary differential equation solvers, and its interaction with other domains like automatic differentiation.
when one is solving a large-scale scientific computing problem with mpi, this is almost always the piece of code where all of the time is spent, so let’s understand how what it’s doing.
To solve:
we now have a function in $un+1$ we want to find roots for. classic.
how we find the roots affects stability; fixed point iteration is not a good choice. instead, use Anderson Acceleration or Newton’s Method. we focus on Newton’s.
say we want
iterate:
but that’s not how we actually do it. numerically, this is two stages:
solve
then: $a = J(x_k)-1g(x_k)$
so we can update: $xk+1 = x_k - a$
Jacobian of
*SDIRK: singly-diagonal implicit Runge-Kutta methods for stiff ODEs. https://juliadiffeq.org/2017/08/13/SDIRK.html
Also, if solving a mass matrix ODE
jacobian is
the simplest way to generate jacobian is via finite differences. for
thus, need
can be improved with dual numbers. formulate multi-dimensional dual number:
now with one computation of primal
columns with matching 0s in jacobian can be combined. pack multiple basis vectors into a single
this can be considered “matrix coloring” / graph coloring problem.
how do we solve
can invert
factorize jacobian,
lower triangular solve:
…can add more stuff
backsubstitution:
finding
ok, this isn’t finding a true inverse but it’s still
in homework, we’ll prove that there’s a variant of newton’s method where you only need to use jacobian of
a “quasi-newton’s method”, which we can show will converge.
can also do “symbolic factorization” to generalize LU factorizations to sparse systems.
what if your factorizations don’t fit in memory?
we don’t actually need to compute
yes:
that is, the directional derivative in the direction of
therefore,
for non-zero
recall that in forward-mode automatic differentiation we can choose directions by seeding the dual part. so we can compute jvp using only a single forward-mode pass, with one partial.
one way to solve
use iterative linear solvers. we want (you guessed it!) a discrete dynamical system whose solution is
so we want iterative process so that:
split
so we want
$wk+1 = A-1(Bw_k + b)$
now, for fixed point
nice.
is this stable?
check eigenvalues of $A-1(Bw_k + b)$. if they’re in unit circle, system is stable.
(jhg: that means, check eigenvalues of jacobian of update w.r.t
note that you can do this by bringing eigenvalues of
that always works, but is equivalent to small step size.
we can compute
Krylov subspace methods.
nice properties: has dimensionality of subspace has maximum value
therefore in
most common: GMRES method.
in step
have a guaranteed bound on jvps: the number of ODEs. that’s not a good bound though; in high dimensional sparse problems, you don’t want to compute
100,000 jvps. so stop when
Let’s take a step back and see what our intermediate conclusion is. In order to solve for the implicit step, it just boils down to doing Newton’s method on some
When the matrix is too large, then one resorts to using a Krylov subspace method, since this only requires being able to do
That’s the basic algorithm, but what are the other important details for getting this right?
however, the speed at GMRES convergences is dependent on the correlations between the vectors, which can be shown to be related to the condition number of the Jacobian matrix. a high condition number makes convergence slower (this is the case for the traditional iterative methods as well), which in turn is an issue because it is the high condition number on the Jacobian which leads to stiffness and causes one to have to use an implicit integrator in the first place!
preconditioning is the process of using a semi-inverse to the matrix in order to split the matrix so that the iterative problem that is being solved is one that is has a smaller condition number.
mathematically, it involves decomposing
or
can re-use jacobian between steps until it diverges, then re-compute
sometimes newton’s method isn’t stable enough! then you need to vary time steps as well.
this needs to be combined with jacobian re-use, since jacobian depends on time step.
to be adaptive, usually use rejection sampling: get some estimate of error in a step.
this is done with embedded method, which is cheap approximation; use that to get error bound. when it gets too big, reduce
many schemes for changing
Here’s a quick summary of the methodologies in a hierarchical sense:
At the lowest level is the linear solve, either done by JFNK or (sparse) factorization. For large enough systems, this is brunt of the work. This is thus the piece to computationally optimize as much as possible, and parallelize. For sparse factorizations, this can be done with a distributed sparse library implementation. For JFNK, the efficiency is simply due to the efficiency of your ODE function f.
An optional level for JFNK is the preconditioning level, where preconditioners can be used to decrease the total number of iterations required for Krylov subspace methods like GMRES to converge, and thus reduce the total number of f calls.
At the nonlinear solver level, different Newton-like techniques are utilized to minimize the number of factorizations/linear solves required, and maximize the stability of the Newton method.
At the ODE solver level, more efficient integrators and adaptive methods for stiff ODEs are used to reduce the cost by affecting the linear solves. Most of these calculations are dominated by the linear solve portion when it’s in the regime of large stiff systems. Jacobian reuse techniques, partial factorizations, and IMEX methods come into play as ways to reduce the cost per factorization and reduce the total number of factorizations.