Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelisation issues with snp_ldsc and SLURM system #534

Open
Sabor117 opened this issue Jan 14, 2025 · 17 comments
Open

Parallelisation issues with snp_ldsc and SLURM system #534

Sabor117 opened this issue Jan 14, 2025 · 17 comments

Comments

@Sabor117
Copy link

Hi Florian,

This is the follow-up from my previous issue I mentioned. I will preface this saying that I'm not sure if this is something you will be able to help with or whether it is an issue with my specific HPC, but you may have seen similar issues and be able to help out.

Essentially, I am submitting LDpred jobs on a SLURM scheduler as follows:

#SBATCH --job-name=ldpredPRS_01_alt_weights_calc_%a
#SBATCH --time=3-00:00:00
#SBATCH --partition=small
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=6
#SBATCH --mem-per-cpu=60G
#SBATCH --account=project_2007428
#SBATCH --array=1-9
#SBATCH --output=/<LOGDIR>/ldpredPRS_01_alt_weights_calc_%a.log
#SBATCH --error=/<LOGDIR>/ldpredPRS_01_alt_weights_calc_%a.log

(And just a quick note here, I have tried both --cpus-per-task=6 and --ntasks=6 for this).

When I am running LDpred (with the pre-computed LD reference panel) everything goes smoothly up until the snp_ldsc function:

ldsc = with(df_beta, bigsnpr::snp_ldsc(ld, ld_size = nrow(hapmap3_snps),
                                chi2 = (beta / beta_se)^2,
                                sample_size = n_eff,
                                ncores = NCORES))

When I submit the jobs as described above, the script runs smoothly until it reaches this function and then it crashes with the following error:

Error in { : task 1 failed - "could not find function "snp_ldsc""
Calls: with ... eval -> eval -> <Anonymous> -> %dopar% -> <Anonymous>
Execution halted

Then when I change it to ntasks=1 and remove cpus-per-task argument, it actually runs and doesn't throw the error. Essentially, it seems almost like most of the nodes aren't loading the function, which is causing it to crash. Despite it working correctly if there is just one node...

HOWEVER, when I didn't use parallelisation it also ran for 3 days straight (max length of time for this sort of job) and then ran out of time.

Have you ever seen anything like this before? Is it something you think I might be able to solve? I am also in touch with IT though as this seems like an issue with the HPC rather than LDpred, but I thought I might ask here just in case.

@privefl
Copy link
Owner

privefl commented Jan 14, 2025

I would try to reinstall bigsnpr, it is odd that it doesn’t find the function.
You can also try loading bigsnpr before.

You can also skip using parallelism for this function as it is only used to get confidence intervals; use blocks = NULL and ncores = 1.

@Sabor117
Copy link
Author

Sabor117 commented Jan 14, 2025

Sorry, to be clear, at the very start of this script I do:

library(data.table)
library(bigsnpr, lib.loc = "/RPackages_421/")
library(bigreadr, lib.loc = "/RPackages_421/")
library(rmio, lib.loc = "/RPackages_421/")

The reason I tried to add the bigsnpr::snp_ldsc instead of just snp_ldsc was to see if I could force the script to find the function (but that didn't work). If you aren't familiar with this issue then it definitely must be a problem with our HPC...

Oh, so instead just do:

ldsc = with(df_beta, bigsnpr::snp_ldsc(ld,
                                chi2 = (beta / beta_se)^2,
                                sample_size = n_eff))

Would this be much quicker as well? And wouldn't impact calculating either of these?:

h2_est = ldsc[["h2"]]

### or

multi_auto = snp_ldpred2_auto(corr, df_beta, h2_init = h2_est,
                               vec_p_init = seq_log(1e-4, 0.2, length.out = 30),
                               allow_jump_sign = FALSE, shrink_corr = 0.95,
                               ncores = NCORES) 

Also, thanks so much for responding while on vacation Florian, there's no huge rush on this so don't push yourself here!

@privefl
Copy link
Owner

privefl commented Jan 16, 2025

Do you need the lib.loc parameter?

Yes, it would be faster to avoid computing the CIs, which you don’t really need for running LDpred2.

@Sabor117
Copy link
Author

Yeah, that's just a local issue for our HPC where some common R packages like dplyr/data.table are available centrally for everyone, but for more niche packages you need to install them separately. In practice it changes nothing as well as it just specifies where to load the package from.

Ah! Okay, that's perfect news then! I'm giving this a try as we speak. That might solve the parallelisation issue and the speed issue simultaneously!

@privefl
Copy link
Owner

privefl commented Jan 17, 2025

bigsnpr is a CRAN package so it might be updated frequently on your cluster. Make sure to update it regularly.

Yeah, but it is avoiding the issue, not really solving it. This might be an issue that you will have with other functions as well.

From what I remember, it should take less than 5 min to run snp_ldsc with 200 blocks over 15 cores.

@Sabor117
Copy link
Author

I'll give it a try uninstalling and updating if this doesn't work!

Wait, what?! As little as 5 minutes?? So, in theory even with only 1 core, surely it should take much less than 72 hours, right?

I submitted jobs with this yesterday after we spoke:

ldsc = with(df_beta, bigsnpr::snp_ldsc(ld, ld_size = nrow(hapmap3_snps),
                                chi2 = (beta / beta_se)^2,
                                sample_size = n_eff,
                                blocks = NULL, # skip parallelisation BUT does not calculate LDSC SEs - default is 200
                                ncores = NCORES))

This has now been running for >24 hours so far (on only 1 core). This is using the pre-computed HapMap+ SNPs and input files. Is this concerning?

@privefl
Copy link
Owner

privefl commented Jan 20, 2025

Yes, this is concerning.

On my old laptop with 4 cores only, it takes

  • 93 seconds to run with the blocks
  • < 1 sec to run without the blocks

@Sabor117
Copy link
Author

Can I ask how many SNPs this was? Was it the full set of 1.4M?

@privefl
Copy link
Owner

privefl commented Jan 20, 2025

Yes, 1.4M.

@Sabor117
Copy link
Author

Sabor117 commented Jan 20, 2025

Okay, I've done more tests of this now and think I made an error in previous messages.

The parallelisation issue I mentioned at the start was a problem with snp_ldsc, but actually it seems like using only one core fixed that and in fact the function was working correctly.

The problem now is in fact with snp_ldpred2_auto. This is the step which was running for 72 hours. Apologies for the confusion there.

@privefl
Copy link
Owner

privefl commented Jan 20, 2025

Again, with 15 cores, LDpred2-auto should run in less than 12 hours.
Are you asking for enough memory? (as in https://github.com/privefl/paper-ldpred2/blob/master/batchtools.slurm.tmpl; we talked about that, right?)

@helijuottonen
Copy link

Hello,

R support of the HPC cluster mentioned above here. We have been trying to get the bigsnpr package working, for example by trying different installation methods, different R versions and different Linux computers but no luck so far. We have been using this script for testing: https://privefl.github.io/bigsnpr/articles/LDpred2.html#computing-ldpred2-scores-genome-wide.

So far, the only computer where we have been able to run the script without errors using multiple cores is my Mac laptop. Linux machines (our HPC cluster, a virtual machine, a laptop) give errors:

Case 1: running on our cluster Puhti in its container-based R environment (R 4.4.0, https://docs.csc.fi/apps/r-env/) with multiple cores gives the errors of the type mentioned above (Error in { : task 1 failed) and some object/function not found for the steps using these functions: snp_ldpred2_grid, snp_ldpred2_auto (also snp_ldsc if ncores > 1). For example:

beta_grid <- snp_ldpred2_grid(corr, df_beta, params, ncores = NCORES)

Error in { : 
  task 1 failed - "object '_bigsnpr_ldpred2_gibbs_one' not found"

SessionInfo:

R version 4.4.0 (2024-04-24)
Platform: x86_64-pc-linux-gnu
Running under: Rocky Linux 8.9 (Green Obsidian)

Matrix products: default
BLAS/LAPACK: /opt/intel/oneapi/mkl/2024.1/lib/libmkl_gf_lp64.so.2;  LAPACK version 3.11.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Helsinki
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] bigsnpr_1.12.18  bigstatsr_1.5.12

loaded via a namespace (and not attached):
 [1] bit_4.0.5          Matrix_1.7-0       gtable_0.3.5       dplyr_1.1.4       
 [5] compiler_4.4.0     tidyselect_1.2.1   Rcpp_1.0.12        parallel_4.4.0    
 [9] doRNG_1.8.6        scales_1.3.0       lattice_0.22-6     bigreadr_0.2.5    
[13] ggplot2_3.5.1      R6_2.5.1           generics_0.1.3     ff_4.0.12         
[17] iterators_1.0.14   tibble_3.2.1       bigparallelr_0.3.2 flock_0.7         
[21] munsell_0.5.1      bigassertr_0.1.6   pillar_1.9.0       bigsparser_0.6.1  
[25] rlang_1.1.3        utf8_1.2.4         doParallel_1.0.17  cli_3.6.2         
[29] magrittr_2.0.3     digest_0.6.35      foreach_1.5.2      grid_4.4.0        
[33] rmio_0.4.0         cowplot_1.1.3      lifecycle_1.0.4    vctrs_0.6.5       
[37] glue_1.7.0         data.table_1.15.4  codetools_0.2-20   rngtools_1.5.2    
[41] parallelly_1.37.1  fansi_1.0.6        colorspace_2.1-0   tools_4.4.0       
[45] pkgconfig_2.0.3

Case 2: I set up a new Rocker-based R container (https://hub.docker.com/r/rocker/tidyverse) with the latest R and package versions on our cluster. This gives the following error for anything using ncores=NCORES from the step 'on-disk sparse genome-wide correlation matrix on-the-fly' onwards:

Error in checkForRemoteErrors(lapply(cl, recvResult)) : 
  one node produced an error: there is no package called ‘RhpcBLASctl’

RhpcBLASctl is installed. I have tried several installation methods for bigsnpr (install.packages(), install_github(), cloning the Github repo and devtools::install() there, but all lead to the error above error with multiple cores.

SessionInfo:

R version 4.4.2 (2024-10-31)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.1 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] RhpcBLASctl_0.23-42 bigsnpr_1.12.18     bigstatsr_1.6.1    

loaded via a namespace (and not attached):
 [1] Matrix_1.7-1       gtable_0.3.6       dplyr_1.1.4        compiler_4.4.2    
 [5] tidyselect_1.2.1   Rcpp_1.0.14        parallel_4.4.2     doRNG_1.8.6.1     
 [9] scales_1.3.0       lattice_0.22-6     bigreadr_0.2.5     ggplot2_3.5.1     
[13] R6_2.5.1           generics_0.1.3     iterators_1.0.14   tibble_3.2.1      
[17] bigparallelr_0.3.2 flock_0.7          munsell_0.5.1      bigassertr_0.1.6  
[21] pillar_1.10.1      bigsparser_0.7.3   rlang_1.1.5        doParallel_1.0.17 
[25] cli_3.6.3          magrittr_2.0.3     digest_0.6.37      foreach_1.5.2     
[29] grid_4.4.2         rmio_0.4.0         cowplot_1.1.3      lifecycle_1.0.4   
[33] vctrs_0.6.5        glue_1.8.0         data.table_1.16.4  codetools_0.2-20  
[37] rngtools_1.5.2     parallelly_1.41.0  colorspace_2.1-1   tools_4.4.2       
[41] pkgconfig_2.0.3 

We would be greatful for any tips on how to get the bigsnpr package working on our cluster! I'm happy to provide any additional information if needed.

@privefl
Copy link
Owner

privefl commented Jan 24, 2025

It sounds like the parallel R processes that are spawned do not used the same installation as the master R process..

What do you get for .libPaths() ?

And for

library(doParallel)
registerDoParallel(cl <- makeCluster(3))
foreach(i = 1:3) %dopar% { .libPaths() }

@helijuottonen
Copy link

Ahhh, I see the problem now - haven't come across a case like this before. I tried one more thing (that I obviously should have tried earlier): installing the package to our central package installation folder that is not available to users. And now bigsnpr works. For R package installations by users, .libPaths()output would look like this, where the first folder is an example of the user's package installation directory:

.libPaths()
[1] "/projappl/project_xxxxxx/username/packages"
[2] "/usr/lib64/R/library"                          
[3] "/appl/soft/math/r-env/440/440-rpackages"    

But if I run foreach on .libPaths(), this folder is missing from the output:

library(doParallel)
registerDoParallel(cl <- makeCluster(3))
foreach(i = 1:3) %dopar% { .libPaths() }

[[1]]
[1] "/usr/lib64/R/library"                   
[2] "/appl/soft/math/r-env/440/440-rpackages"

[[2]]
[1] "/usr/lib64/R/library"                   
[2] "/appl/soft/math/r-env/440/440-rpackages"

[[3]]
[1] "/usr/lib64/R/library"                   
[2] "/appl/soft/math/r-env/440/440-rpackages"

So everything should be good now and the package should work for our users. Thank you for the help!

@privefl
Copy link
Owner

privefl commented Jan 24, 2025

But it means your users cannot properly access in parallel the packages they have installed themselves.

I would encourage you to find a fix for that ;)

@helijuottonen
Copy link

Yes, we will definitely find a fix for this issue! I guess it hadn't come up for a while because many packages use future for parallelization and because of the large number of pre-installed packages we have.

@privefl
Copy link
Owner

privefl commented Jan 28, 2025

@Sabor117 If there is no more issue around this, please close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants