Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in sjSDM_cv #131

Open
YJ781 opened this issue Sep 25, 2023 · 8 comments
Open

Error in sjSDM_cv #131

YJ781 opened this issue Sep 25, 2023 · 8 comments

Comments

@YJ781
Copy link

YJ781 commented Sep 25, 2023

Hi all,

I am running the model with GPU on the supercomputer and have the following error:

Error in cbind(nodes, (n_gpu - 1):0) : object 'nodes' not found
Calls: sjSDM_cv -> tune_func -> cbind

This is what I set for sjSDM_cv
(env = as.matrix(train_X), Y = as.matrix(train_Y),
learning_rate = 0.01, iter = 100L, CV = 10,
tune_steps = 40,
lambda_cov = seq(0, 0.1, 0.001),
lambda_coef = seq(0, 0.1, 0.001),
alpha_cov = seq(0, 1, 0.05),
alpha_coef = seq(0, 1, 0.05),
device = "gpu",
n_gpu = 1,

sampling = 100L,
biotic = bioticStruct(df=dim(train_Y)[2]),
blocks = 6L,
step_size = 4L,
family=gaussian("identity")
)

Do you know how I can figure this out?
Thanks!

@MaximilianPi
Copy link
Member

Hi @YJ781,

If you set n_gpu = 1L , sjSDM_cv is expecting a CPU parallelization which you didn't specify, so either you turn on CPU parallelization by using n_cores = X or you set n_gpu = NULL (n_gpu = NULL and n_gpu = 1 is actually the same, in both cases only one GPU is used).

Thanks for reporting, I will implement a check for that.

@YJ781
Copy link
Author

YJ781 commented Oct 4, 2023

Hi @MaximilianPi ,
Thank you for your explanations. I've tried two settings but got errors as well. Here are the details of the error messages.

n_gpu = NULL
Error in py_call_impl(callable, call_args$unnamed, call_args$named) :
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
Run reticulate::py_last_error() for details.
Calls: sjSDM_cv ... sjSDM -> system.time -> -> py_call_impl
Timing stopped at: 0.494 0.223 0.719
Execution halted

n_gpu=2
Error in cbind(nodes, (n_gpu - 1):0) : object 'nodes' not found
Calls: sjSDM_cv -> tune_func -> cbind
Execution halted

@MaximilianPi
Copy link
Member

Hi @YJ781,

This is a bug, I forgot to move a tensor to the right device.

I pushed a fix to the development branch (master branch of the github repository), you can install the development version by running the following:
devtools::install_github("https://github.com/TheoreticalEcology/s-jSDM", subdir = "sjSDM", ref = "master")

@YJ781
Copy link
Author

YJ781 commented Oct 6, 2023

Hi @MaximilianPi ,

Thank you so much. It worked.

But for n_gpu=2, I'm having the same error,

Error in cbind(nodes, (n_gpu - 1):0) : object 'nodes' not found
Calls: sjSDM_cv -> tune_func -> cbind
Execution halted

How can I deal with it?

@MaximilianPi
Copy link
Member

Hi @YJ781,

Do you still have this error?

@YJ781
Copy link
Author

YJ781 commented Nov 14, 2023 via email

@YJ781
Copy link
Author

YJ781 commented Dec 14, 2023

Hi @MaximilianPi ,

I feel sorry I forgot to @ you in the last comment. I'm still having the same error. Thanks!

@MaximilianPi
Copy link
Member

Hi @YJ781,
I will look into this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants