-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remote errors in interactive mode when larger workflows are run #174
Comments
Hi @JaGeo,
I think it could be possible to enable multiple connections by inserting the OTP multiple times. I will investigate how to do that. |
@gpetretto Thank you for your response. I will make a few additional tests and then answer your questions in more detail. With regard to the size of the workflow: i was referring to one with many jobs. Size of the data per job would not be bigger than a PhononDos Object from the phonon workflow or standard VASP outputs. Restarting the job solves the issue. I don't need to restart the runner. I get REMOTE_ERROR mostly and sometimes a process stops in the middle (e.g., it gets stuck when it downloads the data) An additional suspicion that I have is that there could be a connectivity error within the flow. |
I looked closer into the errors: it sometimes seems to pick up an old project and the pathes of the outputs. Maybe related to #177 |
Thanks for the updates. When you mention an "old project" do you refer to really a different projct with a different configuration file that is present in the Do you maybe have the stack trace reported when the jobs got into the REMOTE_ERROR state? |
I am really referring to an old project. I will check if there is still an old jf runner running on a different computer and get back to you... |
Thanks for the clarification. Do they use the same queue DB? |
I think we can close this. I think there was simply a leftover jf remote running in the background of the other cluster from mid of August, even after logout from the cluster |
Thanks for the update. |
I have been running into many remote errors when I start a larger workflow in the runner's interactive mode, but this does not happen when I start, for example, the Phonon workflow. I am currently suspecting that this might be related to the one connection to the remote cluster that is only established in the interactive mode. If I rerun the jobs, they will eventually run through.
Sometimes, also downloads fail and restarts enable the run.
Could we do anything about this? For example, could we add the possibility to use more than 1 connection in interactive mode to make it more stable? I would be fine with adding more than one OTP if it helps with execution.
The text was updated successfully, but these errors were encountered: