Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question Regarding Convergence of Approximation and Original Method for Correlation Computation #4

Open
qianzach opened this issue Mar 28, 2024 · 3 comments

Comments

@qianzach
Copy link

qianzach commented Mar 28, 2024

Hi @mingzehuang,

I hope this finds you well. I had a quick question regarding the estimations of the latent correlation matrices.

Description

I would like to get an accurate estimate of the latent correlation matrix. There is no compiling error, but I've observed that when I use both the original and approx, I reach a maximum iteration warning, which is likely due to the fact that my data is sparse. I am curious to see if this compromises the accuracy of the estimate.

I would like to know which estimate is better, and if there is a way to improve these estimates because some of the results are quite different. I've tried to adjust the tol parameter as well, but the results stay relatively similar for the original. I have also experimented with different shrinkage values and lower boundary values.

  • latentcor version: 0.2.5
  • Python version: Python 3.11.7
  • Operating System: macOS Ventura

What I Did

Consider the nxk matrix mat where n > k. The latent correlation we want to measure are the column-wise covariates. As such, the tps argument is simply just an array with "tru" (the data we are dealing with are all gene expressions of single cell data, so we assume truncated Gaussian copula).

latentcor(mat, tps = tps_arr, tol = 1e-17 ,method ='original', use_nearPD=True)['R'] #using original method

latentcor(mat, tps = tps_arr, method ='approx', nu = 0.01, ratio = 0.9, use_nearPD=True)['R'] #use approx

The end result is that I get higher magnitudes of correlation (+/- 0.1 more depending on the sign +/-) when using the approximation . However, since both are done executing by the max iteration termination, I'm not sure what is the better estimate.

After looking at some of the base code, I see it may be relevant to nearest_corr(), but the n_fact parameter you have set is already so high, so I am a bit confused by the difference in results. Thank you!

@mingzehuang
Copy link
Owner

Hi, @qianzach; sorry for the late reply! Generally speaking, 'original' is more accurate than 'approx'. However, I'm investigating the convergence problem you mentioned. I'll get back to you ASAP!

@qianzach
Copy link
Author

qianzach commented Apr 6, 2024

No worries! I see. Thank you so much! Just to provide an additional detail-- the issue occurs when using statsmodels.stats.correlation_tools.corr_nearest. I reach maximum iterations (likely due to very sparse data), but it seems like this might be the source of the differences in the approximation and the original method latent correlation.

Thanks!

@mingzehuang
Copy link
Owner

Hi, @qianzach. I think we use corr_nearest() just to further adjust the output to guarantee the output matrix is positive definite. If it doesn't converge properly, you can try turning it off by setting use_nearPD = False. Then, you can get the semi-definite output matrix, which you may adjust yourself. And it's the original result from our algorithm :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants