-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementation of SGE interface #43
base: develop
Are you sure you want to change the base?
Conversation
Hi @ml-evs 😃 because I relied a lot on ChatGPT,so I would be really glad about some feedback! 😁 And as none of our clusters use SGE, I wanted to at least check if the code produces the correct submission script, before I pass it on for testing. Because when I make a submission test using: from jobflow_remote.utils.examples import add
from jobflow_remote import submit_flow
from jobflow import Flow
job1 = add(1, 2)
job2 = add(job1.output, 2)
flow = Flow([job1, job2])
resources = {"cwd": "",
"model": "smp 128",
"job_name": "Autoplex_jf_test",
"qout_path": "$JOB_ID.log",
"qerr_path": "$JOB_ID.err",
"device": "cpu",
"soft_walltime": "05:00:00"}
print(submit_flow(flow, worker="auto_worker", resources=resources, project="auto")) I run into the following problem:
Thank you a lot for your help and feedback! |
Hi @QuantumChemist, nice work! I'll try to find the time to give some feedback before the end of the week. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fantastic work @QuantumChemist! Sorry for the delay giving you some feedback. We'll have to wait for @gpetretto to have a proper look I think but this is already in great shape.
Only one request (and a few minor comments below): could you add a test for submission script generation, similar to what I did for slurm in #45? Ideally this would use the same "maximalist" qresources but will probably support different features.
I guess (or hope!) you are adding this because you would like to use jobflow-remote with SGE, are you able to test this version of QToolkit with your current jobflow setup or do we need some patches in jobflow remote too? If you are, I can carry on the work I started to test hopefully both PBS and SGE directly inside the jobflow-remote CI.
src/qtoolkit/io/sge.py
Outdated
supported += [ | ||
"njobs", | ||
"memory_per_thread", | ||
"time_limit", | ||
"processes", | ||
"processes_per_node", | ||
"process_placement", | ||
"nodes", | ||
"threads_per_process", | ||
"email_address", | ||
"scheduler_kwargs", | ||
"gpus_per_job", | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a problem for this PR, but since #44 I've been wondering if we can programmatically enforce that the fields listed here are definitely qresources members, e.g., specifying an abstract reference to QResources.njobs
and so on, rather than unchecked strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to implement it that way :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I agree that it can be for another PR, since all the objects should be adapted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ml-evs @gpetretto so shall I rather leave this for another PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this now :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I am not sure, was something changed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, that was a bit confusing. I clicked through the PRs and have added this
https://github.com/QuantumChemist/qtoolkit/blob/573641ec772508c71f1f02101902976bf0e1ab80/tests/io/test_sge.py#L240
for now.
Hey @ml-evs! 😃 So far, I couldn't test it with jobflow-remote as although I put sge as I also have a few days off and will try to make the changes when I'm back! |
Co-authored-by: Matthew Evans <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for adding SGE support.
I have left some comments below that might need some update.
The main issue is that if SGE really does not allow to query the jobs by list of ids we will need to update jobflow-remote, since it currently expects to be able to use that functionality. If you can confirm that this is the case I will start thinking how to adapt it. It will either ask for all the jobs (which would likely be inefficient) or I would need to add an option in jobflow-remote to query the jobs based on the user name.
Unfortunately I don't have access to a cluster with SGE so I will not be able to make tests easily. @ml-evs do you have a docker with SGE?
Concerning your initial error in jobflow-remote, it is not obvious guessing what could go wrong. Maybe I would first check if it really using the SGEIO
object.
src/qtoolkit/io/sge.py
Outdated
supported += [ | ||
"njobs", | ||
"memory_per_thread", | ||
"time_limit", | ||
"processes", | ||
"processes_per_node", | ||
"process_placement", | ||
"nodes", | ||
"threads_per_process", | ||
"email_address", | ||
"scheduler_kwargs", | ||
"gpus_per_job", | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I agree that it can be for another PR, since all the objects should be adapted.
Hi @gpetretto . |
It's taking a bit longer than expected to generalise the test docker container in jobflow-remote for other queues, but I'll keep you posted here too. |
Thanks both. |
Sorry @gpetretto , this message kind of slipped my attention and I have just seen it now! I will ask someone to try this! Thank you! |
Hi @gpetretto 👋🏻 so I asked someone if it's possible to query with a list of job-IDs and it's not directly possible, only with a bash script as a workaround ( |
Thanks @QuantumChemist. |
Thank you a lot! :) |
… a FINISHED job state and added another link for SGE documnetation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @QuantumChemist for updating the code.
I left a few more comments.
Concerning #43 (comment) I think we can indeed address this in another PR. I will mark it as resolved.
For the TestSlurmIO::test_submission_script
, I think that the problem is in the fact that the maximalist_qresources
fixture is defined with scope="session"
, but then it is modified inside the tests (both SGE and SLURM). Maybe just remove the scope
and let it be scoped for each function as by default. I suppose this should fix the issue.
src/qtoolkit/io/pbs_base.py
Outdated
raise ValueError(f"Cannot query by user and job(s) in {self.get_system_name()}") | ||
|
||
@abc.abstractmethod | ||
def _get_base_command(self) -> list[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this could have a more explicit name? Something like _get_status_base_command
or _get_list_base_command
. Any other option that clarifies the usage will be fine. Because at first it is not clear what the "base" is for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, I will rename it!
src/qtoolkit/io/pbs_base.py
Outdated
|
||
return v * (1024 ** power_labels[units.lower()]) | ||
|
||
_qresources_mapping = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In moving this here I see that the line
"account": "account"
was lost.
If the list of keys is different between PBE and SGE move _qresources_mapping
to the subclasses with their own list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you for catching that, that was not intended, I will add the account key back to PBS
src/qtoolkit/io/pbs_base.py
Outdated
"""Add soft_walltime if required by child classes (SGE).""" | ||
|
||
@abc.abstractmethod | ||
def get_system_name(self) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe this can be a class attribute, instead of a method?
src/qtoolkit/io/pbs_base.py
Outdated
power_labels = self.get_power_labels() | ||
|
||
if not units: | ||
units = self.get_default_unit() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this is used in the base class, but there is no abstract method for it.
Can maybe be a class attribute?
Are you using the ml-evs/generalize-integration-tests branch because you are using the SGE docker to test? Otherwise, if you needed to filter the SGE jobs by username you probably wanted to use this branch Matgenix/jobflow-remote#187 (I actually just merged it, so you can take the develop one if you want) In any case, I would say that the error is not related to any update that was done in these branches of jobflow remote. The error seems to suggest that it is trying to use |
I think I simply used the wrong jobflow-remote branch. We will try with the one you just mentioned again. Thank you a lot! |
It seems we also forgot to change the scheduler type. Now when we want to switch the scheduler type to sge, we cannot restart the daemon . Do you have an idea what we could try? |
@gpetretto I have now implemented the last changes 😃 |
Hi @QuantumChemist, thanks for the updates. Still, I have seen that you have moved the
If it is not a problem with the configuration file (i.e. you can run the |
Hi @gpetretto ! Sorry for confusing those two things (I'm still learning a lot about Python object oriented programming). |
Thanks @QuantumChemist. It should be fine now. |
thank you, too! |
I might to check more closely with my colleague if there might be a problem with output writing. |
Thanks, there's almost certainly still problems at my end, but it would be good to verify that it's working for you before I dive into debugging again! |
I will let you know when I have checked this with my colleague. I didn't have the time yet to continue this with her. |
Hi @gpetretto and @ml-evs 😃 , I have some good news! The strange thing about this is that the VASP job is finishing fine and there is also no output in queue.err. There is also no other file that would indicate towards a segmentation fault etc. when executing VASP. We were not able to find out why exactly the job was killed. All the jobflow-remote related json files look perfectly fine from what I can tell. |
Thanks for this, I might be seeing the same issue in my testing PR. Let me push a bit further with your latest changes and I'll try to get back to you! |
Alright 👍🏻 |
Hi @QuantumChemist, thanks for testing the new implementation.
|
Hi @gpetretto , thank you for your feedback! I will check the things you pointed out together with my colleague! |
Hi @gpetretto ,
From what I have seen when we tested it together, it looked to me like there was some delay in writing the files. I can't say if this comes from SGE or the cluster Epsilon. Do you have an idea how to handle the delay? |
Hi @QuantumChemist, thanks a lot for the detailed reply.
If that is the problem, I can propose 2 solutions:
I would ask you to try first with option 1. This should confirm that the issue is indeed the FS delay and probably give you a rough estimate of the waiting time required. If this is a use case, we can then consider implementing the option. Would that be fine? In any case, this seems definitely a jobflow-remote issue and not related to your SGE implementation. If you confirm that the problem is the delay, I will open an issue there to further discuss it. This should not be a reason to hold off merging this PR. |
Will close Matgenix/jobflow-remote#159 in jobflow-remote.