Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hard: Implement Implicit QR algorithm with Hessenberg decomposition and shift/defalte tricks. #81

Open
AlexanderMath opened this issue Sep 11, 2023 · 6 comments
Assignees
Labels

Comments

@AlexanderMath
Copy link
Contributor

AlexanderMath commented Sep 11, 2023

Our current ipu_eigh uses the Jacobi algorithm. It is believed for other hw accelerators that the QR algorithm becomes faster than jacobi for larger d>=512. Since we are targeting d>=512 we are considering implementing the QR algorithm.

Main blocker: From Alex (?and Paul?) experience on CPU a naive QR algorithm empirically [1] needs roughly ~d^1.5 iterations to converge. This makes it hard for QR algorithm to compete with Jacobi, which from Alex experience [1] empirically converges in ~d^0.5 iterations. Mature QR algorithm implementations reduce the number of iterations using shift/deflate tricks. Alex have never managed to get these to work. Difficulty could be alleviated if we found a working shift/deflate implementation under OS license we could port to IPU.

Tasks:

  • Implement a naive QR algorithm without Hessenberg decomposition. This takes O(d^4) time. Mature implementations reduce to O(d^3) time by first computing a Hessenberg decomposition.
  • Implement Hessenberg decomposition (Alex calls this "QR the matrix from both sides").
  • Use Hessenberg decomposition to implement the implicit QR algorithm.
  • Potentially improve parallelization of the Hessenberg algorithm by redesigning tricks from blocked QR to Hessenberg (only needed if Hessenberg takes longer than subsequent sparse QR steps).

Notes:

  • The implicit QR algorithm on a Hessenberg matrix is "a lot of small unstructured compute. "
  • Alternatively we could consider "non-exact" algorithms: power iteration (and its variants).

[1] Using matrices M=np.random.normal(0,1, (d,d)); M=(M+M.T)/2. This may be a non-issue for other matrices.

@AlexanderMath AlexanderMath changed the title Use QR algorithm instead of Jacobi algorithm Hard: Use QR algorithm instead of Jacobi algorithm Sep 11, 2023
@AlexanderMath AlexanderMath changed the title Hard: Use QR algorithm instead of Jacobi algorithm Hard: Implement Implicit QR algorithm with Hessenberg decomposition Sep 11, 2023
@AlexanderMath AlexanderMath changed the title Hard: Implement Implicit QR algorithm with Hessenberg decomposition Hard: Implement Implicit QR algorithm with Hessenberg decomposition and shift/defalte tricks. Sep 11, 2023
@AlexanderMath
Copy link
Contributor Author

Initial implementation 76M cycles. Aim for 1M cycles or so. Code on this branch https://github.com/graphcore-research/pyscf-ipu/tree/hessenberg

Note: Algorithm is almost identical to tesselate_ipu.linalg.qr, it just multiplies with another H from the other side.

image

@AlexanderMath
Copy link
Contributor Author

@AlexanderMath
Copy link
Contributor Author

Profile of a single iteration @balancap
image

@AlexanderMath
Copy link
Contributor Author

AlexanderMath commented Sep 22, 2023

@balancap @paolot-gc

Context: For M.shape=(1024,1024) with M.T=M we want eigh(M). We use the classic hessenberg(M)=tri_diagonal to turn problem into eigh(tri_diagonal).

Problem: Literature claims eigvals(tri_diagonal) are easy and eigvcects(tri_diagonal) are hard (e.g. jax.lax.eigh_tridiagonal only supports eigvals and not eigvects).

Here's an algorithm (credit to fhvilshoj): Compute eigvals(tri_diagonal) which are claimed to be easy. Replicate tri_diagonal onto every tile and perform 1024 inverse power iterations in parallel on tri_diagonal-(eps+eighval[tile_i])*I for eps~0.

Correctness: Inverse power iteration converges to the eigenvector with smallest eigenvalue. The shift tri_diagonal-(eps+eighval[tile_i])*I makes the tile_i'th eigenvalue have size eps~0.

Convergence: Since eigval(inv(A))=1/eigval(A) we can make eigval(inv(tri_diagonal-(eps+eighval[tile_i])*I))~1/eps arbitrarily large. The convergence of power iteration (on this inverse matrix) depends on the largest eigenvalue gap, which, we can make arbitrarily large as eps->0. This is great in theory, I have no idea what happens in float32.

Memory: Since d=1024 we get tri_diagonal.nbytes~12.2kb.

Time: We can compute inv(tri_diagonal-c*I)@v with O(n) operations using gaussian elimination.

@awf
Copy link
Collaborator

awf commented Sep 22, 2023

The above is called "Simultaneous Iteration" in https://courses.engr.illinois.edu/cs554/fa2015/notes/12_eigenvalue_8up.pdf. You can't make the eigenvalue gap arbitrarily large if lambda_i = lambda_{i+1}, so in practice you can't make it arbitrailty large if they are close to equal.

In general, I guess we should re-title this issue "Speed up eigh computation", and the first task is to gather potential implementation strategies, e.g. by just grabbing the slide headings from the lecture above.

As all of these approaches end up with provisos such as "Algorithm is complicated to implement and difficult questions of numerical stability, eigenvector orthogonality, and load balancing must be addressed", it's probably a good idea to see if existing code such as scalapack (or ARPACK for online-computed ERI) has been ported to e.g. numpy.

@AlexanderMath
Copy link
Contributor Author

AlexanderMath commented Sep 22, 2023

You can't make the eigenvalue gap arbitrarily large if lambda_i = lambda_{i+1}, so in practice you can't make it arbitrarily large if they are close to equal.

Agree.

The above is called "Simultaneous Iteration" in https://courses.engr.illinois.edu/cs554/fa2015/notes/12_eigenvalue_8up.pdf.

Do you mean "parallel inverse iteration"? Simultaneous iteration doesn't use inverse and it requires normalization (?ie.? orthogonalize the q simultaneous eigenvectors?).

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants