Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of SGE interface #43

Open
wants to merge 37 commits into
base: develop
Choose a base branch
from

Conversation

QuantumChemist
Copy link

Will close Matgenix/jobflow-remote#159 in jobflow-remote.

@QuantumChemist QuantumChemist marked this pull request as ready for review August 7, 2024 15:33
@QuantumChemist
Copy link
Author

QuantumChemist commented Aug 7, 2024

Hi @ml-evs 😃
I have finished the first version of the implementation but it's very hacky,

because I relied a lot on ChatGPT,

so I would be really glad about some feedback! 😁

And as none of our clusters use SGE, I wanted to at least check if the code produces the correct submission script, before I pass it on for testing.
So, I wanted to ask if there is something else I have to adjust in jobflow-remote?

Because when I make a submission test using:

from jobflow_remote.utils.examples import add
from jobflow_remote import submit_flow
from jobflow import Flow

job1 = add(1, 2)
job2 = add(job1.output, 2)

flow = Flow([job1, job2])

resources = {"cwd": "",
            "model": "smp 128",
            "job_name": "Autoplex_jf_test",
            "qout_path": "$JOB_ID.log",
            "qerr_path": "$JOB_ID.err",
            "device": "cpu",
            "soft_walltime": "05:00:00"}

print(submit_flow(flow, worker="auto_worker", resources=resources, project="auto"))

I run into the following problem:

╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│     created_on = '2024-08-07 16:45'                                                                                                                        │
│          db_id = '2632'                                                                                                                                    │
│          index = 1                                                                                                                                         │
│       metadata = {}                                                                                                                                        │
│           name = 'add'                                                                                                                                     │
│        parents = []                                                                                                                                        │
│ previous_state = 'UPLOADED'                                                                                                                                │
│       priority = 0                                                                                                                                         │
│         remote = {                                                                                                                                         │
│                      'step_attempts': 3,                                                                                                                   │
│                      'retry_time_limit': '2024-08-07 17:11',                                                                                               │
│                      'error': Traceback (most recent call last):                                                                                           │
│                    File "/home/certural/miniconda3/envs/auto/lib/python3.10/site-packages/jobflow_remote/jobs/jobcontroller.py", line 3333, in             │
│                  lock_job_for_update                                                                                                                       │
│                      **kwargs,                                                                                                                             │
│                    File "/home/certural/miniconda3/envs/auto/lib/python3.10/site-packages/jobflow_remote/jobs/runner.py", line 538, in advance_state       │
│                    File "/home/certural/miniconda3/envs/auto/lib/python3.10/site-packages/jobflow_remote/jobs/runner.py", line 685, in submit              │
│                    File "/home/certural/miniconda3/envs/auto/lib/python3.10/site-packages/jobflow_remote/remote/queue.py", line 155, in submit             │
│                      options=options,                                                                                                                      │
│                    File "/home/certural/miniconda3/envs/auto/lib/python3.10/site-packages/jobflow_remote/remote/queue.py", line 186, in                    │
│                  write_submission_script                                                                                                                   │
│                      options=options,                                                                                                                      │
│                    File "/home/certural/miniconda3/envs/auto/lib/python3.10/site-packages/jobflow_remote/remote/queue.py", line 104, in                    │
│                  get_submission_script                                                                                                                     │
│                      return ""                                                                                                                             │
│                    File "/home/certural/miniconda3/envs/auto/lib/python3.10/site-packages/qtoolkit/io/base.py", line 57, in get_submission_script          │
│                      if header := self.generate_header(options):                                                                                           │
│                    File "/home/certural/miniconda3/envs/auto/lib/python3.10/site-packages/qtoolkit/io/base.py", line 88, in generate_header                │
│                      raise ValueError(msg)                                                                                                                 │
│                  ValueError: The following keys are not present in the template: cwd, device, model, soft_walltime                                              │
│                                                                                                                                                            │
│                  }                                                                                                                                         │
│        run_dir = '/home/certural/Calcs_new/67/c5/8f/67c58fcb-d847-4fa1-8d29-9262ce0efa92_1'                                                                │
│          state = 'REMOTE_ERROR'                                                                                                                            │
│     updated_on = '2024-08-07 17:11'                                                                                                                        │
│           uuid = '67c58fcb-d847-4fa1-8d29-9262ce0efa92'                                                                                                    │
│         worker = 'bam_worker'                                                                                                                              │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Thank you a lot for your help and feedback!

@ml-evs
Copy link
Member

ml-evs commented Aug 7, 2024

Hi @QuantumChemist, nice work! I'll try to find the time to give some feedback before the end of the week. Thanks!

@ml-evs ml-evs self-requested a review August 15, 2024 22:58
Copy link
Member

@ml-evs ml-evs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic work @QuantumChemist! Sorry for the delay giving you some feedback. We'll have to wait for @gpetretto to have a proper look I think but this is already in great shape.

Only one request (and a few minor comments below): could you add a test for submission script generation, similar to what I did for slurm in #45? Ideally this would use the same "maximalist" qresources but will probably support different features.

I guess (or hope!) you are adding this because you would like to use jobflow-remote with SGE, are you able to test this version of QToolkit with your current jobflow setup or do we need some patches in jobflow remote too? If you are, I can carry on the work I started to test hopefully both PBS and SGE directly inside the jobflow-remote CI.

src/qtoolkit/io/sge.py Outdated Show resolved Hide resolved
src/qtoolkit/io/sge.py Outdated Show resolved Hide resolved
Comment on lines 513 to 525
supported += [
"njobs",
"memory_per_thread",
"time_limit",
"processes",
"processes_per_node",
"process_placement",
"nodes",
"threads_per_process",
"email_address",
"scheduler_kwargs",
"gpus_per_job",
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a problem for this PR, but since #44 I've been wondering if we can programmatically enforce that the fields listed here are definitely qresources members, e.g., specifying an abstract reference to QResources.njobs and so on, rather than unchecked strings.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to implement it that way :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I agree that it can be for another PR, since all the objects should be adapted.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ml-evs @gpetretto so shall I rather leave this for another PR?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this now :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I am not sure, was something changed here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, that was a bit confusing. I clicked through the PRs and have added this
https://github.com/QuantumChemist/qtoolkit/blob/573641ec772508c71f1f02101902976bf0e1ab80/tests/io/test_sge.py#L240
for now.

@ml-evs ml-evs requested a review from gpetretto August 16, 2024 10:21
@QuantumChemist
Copy link
Author

Fantastic work @QuantumChemist! Sorry for the delay giving you some feedback. We'll have to wait for @gpetretto to have a proper look I think but this is already in great shape.

Only one request (and a few minor comments below): could you add a test for submission script generation, similar to what I did for slurm in #45? Ideally this would use the same "maximalist" qresources but will probably support different features.

I guess (or hope!) you are adding this because you would like to use jobflow-remote with SGE, are you able to test this version of QToolkit with your current jobflow setup or do we need some patches in jobflow remote too? If you are, I can carry on the work I started to test hopefully both PBS and SGE directly inside the jobflow-remote CI.

Hey @ml-evs! 😃
thank you so much for your feedback.
I'm totally all for adding a test as in #45

So far, I couldn't test it with jobflow-remote as although I put sge as scheduler_type, I still have the problem as described above and I didn't have time yet to investigate it further.

I also have a few days off and will try to make the changes when I'm back!

Co-authored-by: Matthew Evans <[email protected]>
Copy link
Contributor

@gpetretto gpetretto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for adding SGE support.
I have left some comments below that might need some update.
The main issue is that if SGE really does not allow to query the jobs by list of ids we will need to update jobflow-remote, since it currently expects to be able to use that functionality. If you can confirm that this is the case I will start thinking how to adapt it. It will either ask for all the jobs (which would likely be inefficient) or I would need to add an option in jobflow-remote to query the jobs based on the user name.
Unfortunately I don't have access to a cluster with SGE so I will not be able to make tests easily. @ml-evs do you have a docker with SGE?

Concerning your initial error in jobflow-remote, it is not obvious guessing what could go wrong. Maybe I would first check if it really using the SGEIO object.

src/qtoolkit/io/sge.py Outdated Show resolved Hide resolved
src/qtoolkit/io/sge.py Outdated Show resolved Hide resolved
src/qtoolkit/io/sge.py Outdated Show resolved Hide resolved
src/qtoolkit/io/sge.py Outdated Show resolved Hide resolved
src/qtoolkit/io/sge.py Outdated Show resolved Hide resolved
src/qtoolkit/io/sge.py Outdated Show resolved Hide resolved
src/qtoolkit/io/sge.py Outdated Show resolved Hide resolved
src/qtoolkit/io/sge.py Outdated Show resolved Hide resolved
Comment on lines 513 to 525
supported += [
"njobs",
"memory_per_thread",
"time_limit",
"processes",
"processes_per_node",
"process_placement",
"nodes",
"threads_per_process",
"email_address",
"scheduler_kwargs",
"gpus_per_job",
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I agree that it can be for another PR, since all the objects should be adapted.

src/qtoolkit/io/sge.py Outdated Show resolved Hide resolved
@QuantumChemist
Copy link
Author

Hi @gpetretto .
Thank you a lot for having a look at my PR. I think I will need quite a while to properly work through everything.
One problem that I have is that I don't have access to a system with SGE (as I'm not implementing this for myself) but I can ask people to test it. Therefore I just started with what I have found in aiida and SGE manuals.

@ml-evs
Copy link
Member

ml-evs commented Aug 21, 2024

Hi @gpetretto . Thank you a lot for having a look at my PR. I think I will need quite a while to properly work through everything. One problem that I have is that I don't have access to a system with SGE (as I'm not implementing this for myself) but I can ask people to test it. Therefore I just started with what I have found in aiida and SGE manuals.

It's taking a bit longer than expected to generalise the test docker container in jobflow-remote for other queues, but I'll keep you posted here too.

@gpetretto
Copy link
Contributor

Thanks both.
@QuantumChemist in the meanwhile can you check with someone that have access to SGE if it is actually possible to query with qstat using a list of job ids? I have sketched an implementation in jobflow-remote that relies on the username and it seems feasible. If it is not possible to query by job ids I will finilize it.

@QuantumChemist
Copy link
Author

Thanks both. @QuantumChemist in the meanwhile can you check with someone that have access to SGE if it is actually possible to query with qstat using a list of job ids? I have sketched an implementation in jobflow-remote that relies on the username and it seems feasible. If it is not possible to query by job ids I will finilize it.

Sorry @gpetretto , this message kind of slipped my attention and I have just seen it now! I will ask someone to try this! Thank you!

@QuantumChemist
Copy link
Author

Thanks both. @QuantumChemist in the meanwhile can you check with someone that have access to SGE if it is actually possible to query with qstat using a list of job ids? I have sketched an implementation in jobflow-remote that relies on the username and it seems feasible. If it is not possible to query by job ids I will finilize it.

Hi @gpetretto 👋🏻

so I asked someone if it's possible to query with a list of job-IDs and it's not directly possible, only with a bash script as a workaround (job_ids=(202354 202353); for job_id in "${job_ids[@]}"; do qstat -j $job_id; done) and querying a single job-ID looks like:

Screenshot 2024-09-18 at 17 47 23

@gpetretto
Copy link
Contributor

Thanks @QuantumChemist.
I will then add the option based on the username to jobflow-remote. Otherwise even with this implementation, SGE would not be usable.

@QuantumChemist
Copy link
Author

Thanks @QuantumChemist. I will then add the option based on the username to jobflow-remote. Otherwise even with this implementation, SGE would not be usable.

Thank you a lot! :)

Copy link
Contributor

@gpetretto gpetretto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @QuantumChemist for updating the code.
I left a few more comments.

Concerning #43 (comment) I think we can indeed address this in another PR. I will mark it as resolved.

For the TestSlurmIO::test_submission_script, I think that the problem is in the fact that the maximalist_qresources fixture is defined with scope="session", but then it is modified inside the tests (both SGE and SLURM). Maybe just remove the scope and let it be scoped for each function as by default. I suppose this should fix the issue.

raise ValueError(f"Cannot query by user and job(s) in {self.get_system_name()}")

@abc.abstractmethod
def _get_base_command(self) -> list[str]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this could have a more explicit name? Something like _get_status_base_command or _get_list_base_command. Any other option that clarifies the usage will be fine. Because at first it is not clear what the "base" is for.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, I will rename it!


return v * (1024 ** power_labels[units.lower()])

_qresources_mapping = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In moving this here I see that the line
"account": "account"
was lost.
If the list of keys is different between PBE and SGE move _qresources_mapping to the subclasses with their own list

Copy link
Author

@QuantumChemist QuantumChemist Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for catching that, that was not intended, I will add the account key back to PBS

"""Add soft_walltime if required by child classes (SGE)."""

@abc.abstractmethod
def get_system_name(self) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this can be a class attribute, instead of a method?

power_labels = self.get_power_labels()

if not units:
units = self.get_default_unit()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this is used in the base class, but there is no abstract method for it.
Can maybe be a class attribute?

@QuantumChemist
Copy link
Author

QuantumChemist commented Sep 25, 2024

Hey!
so this is how it looks like when my colleague uses my qtoolkit repo and the ml-evs/generalize-integration-tests jfr repo:
image

Did we maybe take the wrong jobflow-remote branch?

@gpetretto
Copy link
Contributor

Are you using the ml-evs/generalize-integration-tests branch because you are using the SGE docker to test? Otherwise, if you needed to filter the SGE jobs by username you probably wanted to use this branch Matgenix/jobflow-remote#187 (I actually just merged it, so you can take the develop one if you want)

In any case, I would say that the error is not related to any update that was done in these branches of jobflow remote. The error seems to suggest that it is trying to use PBEIO rather than SGEIO. Although I admit it is trivial, the first thing that I would suggest to check is that the scheduler_type value in the configuration file has been switched to SGE.

@QuantumChemist
Copy link
Author

Are you using the ml-evs/generalize-integration-tests branch because you are using the SGE docker to test? Otherwise, if you needed to filter the SGE jobs by username you probably wanted to use this branch Matgenix/jobflow-remote#187 (I actually just merged it, so you can take the develop one if you want)

In any case, I would say that the error is not related to any update that was done in these branches of jobflow remote. The error seems to suggest that it is trying to use PBEIO rather than SGEIO. Although I admit it is trivial, the first thing that I would suggest to check is that the scheduler_type value in the configuration file has been switched to SGE.

I think I simply used the wrong jobflow-remote branch. We will try with the one you just mentioned again. Thank you a lot!

@QuantumChemist
Copy link
Author

Are you using the ml-evs/generalize-integration-tests branch because you are using the SGE docker to test? Otherwise, if you needed to filter the SGE jobs by username you probably wanted to use this branch Matgenix/jobflow-remote#187 (I actually just merged it, so you can take the develop one if you want)

In any case, I would say that the error is not related to any update that was done in these branches of jobflow remote. The error seems to suggest that it is trying to use PBEIO rather than SGEIO. Although I admit it is trivial, the first thing that I would suggest to check is that the scheduler_type value in the configuration file has been switched to SGE.

It seems we also forgot to change the scheduler type.

Now when we want to switch the scheduler type to sge, we cannot restart the daemon . Do you have an idea what we could try?

@QuantumChemist
Copy link
Author

@gpetretto I have now implemented the last changes 😃
let me know if more adjustments are needed!

@gpetretto
Copy link
Contributor

Hi @QuantumChemist, thanks for the updates. Still, I have seen that you have moved the _qresources_mapping and the other methods to be attributes of the instances rather than of the classes. I know it will not change much from a practical point of view, but can you make them class attributes, for uniformity with the other ones?

Now when we want to switch the scheduler type to sge, we cannot restart the daemon . Do you have an idea what we could try?

If it is not a problem with the configuration file (i.e. you can run the jf project check successfully), I would suggest looking at the logs in ~/.jfremote/PROJECT_NAME/log/runner.log to see if you can find any error and/or trying running the runner with jf runner run`. This will start the runner in the foreground. It is not really meant for standard execution, but it is a quick way to get the stacktrace if the runner fails for whatever reason.

@QuantumChemist
Copy link
Author

Hi @gpetretto ! Sorry for confusing those two things (I'm still learning a lot about Python object oriented programming).
Is it ok like this?

@gpetretto
Copy link
Contributor

Thanks @QuantumChemist. It should be fine now.
Did you manage to test this successfully together with jobflow-remote? Or are you still experiencing issues?

@QuantumChemist
Copy link
Author

Thanks @QuantumChemist. It should be fine now. Did you manage to test this successfully together with jobflow-remote? Or are you still experiencing issues?

thank you, too!
My colleague's server (with SGE) is down this week and I have some days off until next week, so I couldn't test it yet.

@ml-evs
Copy link
Member

ml-evs commented Oct 11, 2024

Matgenix/jobflow-remote#160 (comment)

@QuantumChemist
Copy link
Author

Matgenix/jobflow-remote#160 (comment)

I might to check more closely with my colleague if there might be a problem with output writing.

@ml-evs
Copy link
Member

ml-evs commented Oct 11, 2024

Thanks, there's almost certainly still problems at my end, but it would be good to verify that it's working for you before I dive into debugging again!

@QuantumChemist
Copy link
Author

Thanks, there's almost certainly still problems at my end, but it would be good to verify that it's working for you before I dive into debugging again!

I will let you know when I have checked this with my colleague. I didn't have the time yet to continue this with her.

@QuantumChemist
Copy link
Author

QuantumChemist commented Oct 24, 2024

Hi @gpetretto and @ml-evs 😃 ,

I have some good news!
Today I was finally able to extensively check my implementation with my colleague via videocall and in principle the SGE implementation works. We are able to successfully run a "add" workflow for testing and everything completes fine (i.e. all job states are "COMPLETED" in the end).
But when we want to run a DoubleRelax VASP wf, we are stuck at the following step:
Bild (3)

The strange thing about this is that the VASP job is finishing fine and there is also no output in queue.err. There is also no other file that would indicate towards a segmentation fault etc. when executing VASP. We were not able to find out why exactly the job was killed. All the jobflow-remote related json files look perfectly fine from what I can tell.
It's a bit hard to say where this issue could be rooted in, whether it's the cluster, qtoolkit or jobflow-remote.
Do you have any ideas? Thank you a lot!

@ml-evs
Copy link
Member

ml-evs commented Oct 24, 2024

I have some good news! Today I was finally able to extensively check my implementation with my colleague via videocall and in principle the SGE implementation works. We are able to successfully run a "add" workflow for testing and everything completes fine

Thanks for this, I might be seeing the same issue in my testing PR. Let me push a bit further with your latest changes and I'll try to get back to you!

@QuantumChemist
Copy link
Author

I have some good news! Today I was finally able to extensively check my implementation with my colleague via videocall and in principle the SGE implementation works. We are able to successfully run a "add" workflow for testing and everything completes fine

Thanks for this, I might be seeing the same issue in my testing PR. Let me push a bit further with your latest changes and I'll try to get back to you!

Alright 👍🏻

@gpetretto
Copy link
Contributor

gpetretto commented Oct 25, 2024

Hi @QuantumChemist, thanks for testing the new implementation.
I should say I don't have an idea about why such an error could happen only when switching to atomate2 jobs. Does SGE have any command that would provide details on a terminated job? Just to know if it could be indeed possible that SGE killed the job, or not.
Just a few words to clarify the meaning of the error: jobflow-remote writes the output in a file Store in the run folder: for atomate2 you will expect to have a remote_job_data.json and a additional_store_data.json files in the folder. The rest of the response is written in the jfremote_out.json file. However, the jfremote_out.json is written first at the beginning, with just the starting time of the execution of the job: https://github.com/Matgenix/jobflow-remote/blob/82457776ed3ffa5fabca74daa4ee7af24cbaab4e/src/jobflow_remote/jobs/run.py#L40. The response is written at the end: https://github.com/Matgenix/jobflow-remote/blob/82457776ed3ffa5fabca74daa4ee7af24cbaab4e/src/jobflow_remote/jobs/run.py#L82.
If the VASP calculation completed, but the jfremote_out.json does not contain the response, the problem should have happened between the simulation end and the dumping of the file.
A few things that could be checked are:

  • does the remote_job_data.json contain the TaskDoc with output of the vasp job?
  • does the jfremote_out.json contain indeed only the start_time? Or also something else, but not the response?
  • Could it be that the final jfremote_out.json was written somewhere else, for example in the home? (I hardly believe so but just to check)
  • have the files like queue.err, queue.out, and the jsons been zipped by the atomate2 job?

@QuantumChemist
Copy link
Author

Hi @gpetretto ,

thank you for your feedback! I will check the things you pointed out together with my colleague!

@QuantumChemist
Copy link
Author

Hi @gpetretto ,
my colleague has investigated the problem further with your tips and here is what she says:

Hi Chris,

Just gone through the comments:

  • SGE command to get detailed job information: qstat -explain c -j JOB_ID, used this to track the jobs whilst running but showed no weird signs as far as I can tell, and once the job is completed this no longer shows any information.
  • remote_job_data.json does seem to contain the TaskDoc he mentions, with the outputs e.g. free energy, time elapsed etc within it
  • I’m pretty sure the jfremote_out.json has not been written anywhere else, not sure how I’d extensively check this but can’t see any files anywhere else
  • jfremote_out.json does contain the start_time and the response
  • The queue.err, queue.out, and json files have not been zipped. All the VASP outputs have been zipped. (In the successful “add” workflows the err, out and json files were not zipped, so I assume this is a good sign)

However, have been doing some more experimenting this evening, and have managed to get both Python and VASP test runs working. I kept on getting different errors today, looking like this:

Screenshot 2024-10-28 at 16 50 11

And I think the issue here is that it was taking a very long time for jfremote_out.json to be written to the output folder after a calculation was completed, so it would be marked as a REMOTE_ERROR, then the out file would appear sometimes half an hour later. Running Jf job retry -s REMOTE_ERROR after that seemed to fix the workflow.
It all seems a bit inconsistent though, and maybe dependent on how heavily Epsilon is being used at the time and so how long it takes to write the outputs.

From what I have seen when we tested it together, it looked to me like there was some delay in writing the files. I can't say if this comes from SGE or the cluster Epsilon. Do you have an idea how to handle the delay?

@gpetretto
Copy link
Contributor

Hi @QuantumChemist,

thanks a lot for the detailed reply.
From what you report, it indeed seems related to a delay between the end of the SGE job and the availability of the files on the file system of the front end of the cluster. I suppose this might be caused by a slow/overloaded NFS for example.
The output files like jfremote_out.json are written by jobflow inside the SGE job. Jobflow-remote's runner will just check if the SGE job is completed and then proceed with the download of the files. So here you are hitting two different kinds of issues:

  1. if the job is very short (like in the last error), the jfremote_out.json is not even present, getting you that error while trying to download the files
  2. for a more normal job, the file will be created at the beginning of the job and since it takes some time to be executed, it will be synchronized. But if when the job ends it takes some time to synchronize the updated version, the file will be downloaded, thus completing correctly the "download" phase. And then it will fail at the "complete" step because the response is missing.

If that is the problem, I can propose 2 solutions:

  1. in the project configuration you can add a post_run for the worker. You can add a sleep there. This will be executed after the jfremote_out.json has been created, so it will give the system the time to synchronize it. I have no clue how long this sleep needs to be, as this should probably be based on an reasonable estimate of how long the system takes. This will waste some resources, since the sleep will be in the SGE job. However, if this is just a few seconds and it seems acceptable, then it could be the solution.
  2. If the sychronization could istead take some time and adding a sleep for all the jobs may end up wasting a lot of computational resources, I think we can consider adding a delay_download option to the worker configuration that will postpone the start of the download by a specified amount of seconds. I have an idea about how to do that, but I will need to check that it really works fine.

I would ask you to try first with option 1. This should confirm that the issue is indeed the FS delay and probably give you a rough estimate of the waiting time required. If this is a use case, we can then consider implementing the option. Would that be fine?

In any case, this seems definitely a jobflow-remote issue and not related to your SGE implementation. If you confirm that the problem is the delay, I will open an issue there to further discuss it. This should not be a reason to hold off merging this PR.
@ml-evs any update on your side about the tests?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

How could I use SGE for job submission?
3 participants