You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In some applications, we may require that the LD matrix (SNP-by-SNP correlation matrix) be positive definite (PD) or positive semi-definite (PSD). Very often, this is not the case, which may result in instabilities for some downstream models. Therefore, it would be great to augment our LD functionalities to check for PSD and to also find the nearest PSD matrix to the sample LD matrix.
A good starting point may be the code snippet in this StackOverflow thread. I modified this function to find the nearest PSD while retaining the sparsity structure in the original matrix:
fromnumpyimportlinalgasladefnearestPD(A):
"""Find the nearest positive-definite matrix to input A Python/Numpy port of John D'Errico's `nearestSPD` MATLAB code [1], which credits [2]. Credit: https://stackoverflow.com/a/43244194 Modified by Shadi Zabad to retain the sparsity structure in the original matrix. [1] https://www.mathworks.com/matlabcentral/fileexchange/42885-nearestspd [2] N.J. Higham, "Computing a nearest symmetric positive semidefinite matrix" (1988): https://doi.org/10.1016/0024-3795(88)90223-6 """np.fill_diagonal(A, 1.)
B= (A+A.T) /2_, s, V=la.svd(B)
H=np.dot(V.T, np.dot(np.diag(s), V))
A2= (B+H) /2A3= (A2+A2.T) /2A3[A==0.] =0.ifisPD(A3):
returnA3spacing=np.spacing(la.norm(A))
# The above is different from [1]. It appears that MATLAB's `chol` Cholesky# decomposition will accept matrixes with exactly 0-eigenvalue, whereas# Numpy's will not. So where [1] uses `eps(mineig)` (where `eps` is Matlab# for `np.spacing`), we use the above definition. CAVEAT: our `spacing`# will be much larger than [1]'s `eps(mineig)`, since `mineig` is usually on# the order of 1e-16, and `eps(1e-16)` is on the order of 1e-34, whereas# `spacing` will, for Gaussian random matrixes of small dimension, be on# othe order of 1e-16. In practice, both ways converge, as the unit test# below suggests.I=np.eye(A.shape[0])
k=1whilenotisPD(A3):
mineig=np.min(np.real(la.eigvals(A3)))
A3+=I* (-mineig*k**2+spacing)
A3[A==0.] =0.k+=1returnA3defisPD(B):
"""Returns true when input is positive-definite, via Cholesky"""try:
_=la.cholesky(B)
returnTrueexceptla.LinAlgError:
returnFalse
The main bottleneck here is that we need to perform SVD on the LD matrix, which may be computationally expensive, even with sparse arrays. One way to get around this difficulty is to deploy this within LD blocks, which may be more manageable.
To experiment with this, we can try to tackle the following tasks:
Extract blocks from LDMatrix corresponding to the LD blocks defined by the block estimator.
Find nearest PSD for each block separately.
Check if this alleviates computational or numerical instabilities for some downstream tasks (e.g. viprs).
The text was updated successfully, but these errors were encountered:
Another option that seems to work well in some settings is to use the shrinkage algorithm of Higham et al. (2014). Here's a rough implementation of the idea:
fromscipy.sparseimportidentityfromscipy.sparse.linalgimporteigsh, ArpackNoConvergencedefisPSD_sparse(B):
""" Check for positive semi-definitness using the ARPACK library by examining the smallest eigenvalues. """try:
returneigsh(B, 1, which='SA', return_eigenvectors=False)[0] >=0.exceptArpackNoConvergence:
returneigsh(B, 1, which='SA', sigma=0, return_eigenvectors=False)[0] >=0.defnearcorr_bisection(m0, tol=1e-5):
""" Find the nearest positive semi-definite correlation matrix using the bisection method (Algorithm 3.1) of Higham et al. (2014). The method finds an intermediate matrix between the original `m0` and the identity matrix of the same shape. The benefit of this is that it retains the sparsity structure in the original matrix, which is important for large covariance matrices in genomics. One major difference from Higham et al. is that we use the scipy's and ARPACK's sparse matrix operations to find the smallest eigen values efficiently, instead of performing cholesky decomposition to check for positive semi-definitness. The method specifically returns the `alpha` value from the shrinkage procedure, instead of the updated matrix. This is because under this formulation, we'd only need to multiply the off-diagonal elements by (1 - a_*) to obtain the shrunk correlation matrix. """a_l=0.a_r=1.m1=identity(m0.shape[0])
whilea_r-a_l>tol:
print(a_l, a_r)
a_m=.5*(a_l+a_r)
ifnotisPSD_sparse(a_m*m1+ (1.-a_m)*m0):
a_l=a_melse:
a_r=a_mreturna_r
Needs more testing, but looks like a promising approach that can be very fast.
In some applications, we may require that the LD matrix (SNP-by-SNP correlation matrix) be positive definite (PD) or positive semi-definite (PSD). Very often, this is not the case, which may result in instabilities for some downstream models. Therefore, it would be great to augment our LD functionalities to check for PSD and to also find the nearest PSD matrix to the sample LD matrix.
A good starting point may be the code snippet in this StackOverflow thread. I modified this function to find the nearest PSD while retaining the sparsity structure in the original matrix:
The main bottleneck here is that we need to perform SVD on the LD matrix, which may be computationally expensive, even with sparse arrays. One way to get around this difficulty is to deploy this within LD blocks, which may be more manageable.
To experiment with this, we can try to tackle the following tasks:
LDMatrix
corresponding to the LD blocks defined by the block estimator.viprs
).The text was updated successfully, but these errors were encountered: