Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Featuer Idea] Fake worker #50

Open
jmmshn opened this issue Jan 14, 2024 · 8 comments
Open

[Featuer Idea] Fake worker #50

jmmshn opened this issue Jan 14, 2024 · 8 comments

Comments

@jmmshn
Copy link

jmmshn commented Jan 14, 2024

Hi @gpetretto et al,
Thanks so much for open-sourcing this code, this is fantastic!
I have a couple of questions / (too specific) problems.

  • My main clusters do not support SSH private keys and only support multiplexing/ControlMaster. Which does not seem to work with paramiko.
    Implement support for ControlMaster / ControlPath / etc paramiko/paramiko#852
  • A lot of people using atomate1 have a nice setup where they have a cron job set up which will keep submitting the same identical job which just calls fireworks. This allows them to keep accumulating priority and move ahead in the line. So this will be a nice feature to have depending on how your cluster determines priority.

In the current setup, I don't see anything about setting up just a USER on the local computer and having RUNNER + WORKER on the cluster.
Having something like this will effectively solve both problems for me since the local worker just has to talk to DBs without needing to access anything via SSH. Also, the cluster can now just keep submitting a job that just says "get a job and run it."

I poked around the code and it looks like this should be possible but it will be nice to have a "dummy WORKER" that just acts as a placeholder and just has USER and RUNNER talk to each other on local and have REMOTE and WORKER talk inside of each SLURM job. I guess you can have USER+RUNNER+REMOTE on the cluster and USER+RUNNER+(fake worker) on local. Then you can perform all the JFR operation on the cluster but still get manipulate data and submit jobs on local.

Let me know if this makes sense and I'm willing to start working on this as PR.

@gpetretto
Copy link
Contributor

Hi @jmmshn, thanks for the interest in the code and for mentioning this use case. I have a few comments:

  1. I don't see a particular problem in using a configuration with USER and RUNNER+WORKER. I should point out that the original idea was to use it in the configuration with 2 machines USER+RUNNER and WORKER, since this was the use case that we had to deal with. For the configurations that split USER and RUNNER there is a minor downside: in some CLI operations (e.g. reset of the database and a few others) the code prevents their execution if the daemon is active to avoid the risk of inconsistencies in the DB. Of course this means that this check would be meaningless if USER and RUNNER are split, but this would likely not be an issue as long as a user does not perform such operations while the runner is active. I should add this point to the documentation.

  2. I am not sure if this is what your were implying, but we have basically constructed jobflow-remote in such a way that the node would never access the DB. While it would be possible to change this point to also allow a connection from a SLURM job to the DB, I think that this could open a series of issues, given that the interactions with the DB were not designed with this in mind. At this stage I would thus refrain from introducing such a change.

  3. I think you can still achieve something similar to what you need using a feature that is still not mentioned in the documentation: the batch submission. This is intended to offer a functionality similar to the one in fireworks where you can submit a job containing rlaunch rapidfire rather than rlaunch singleshot. A worker is a batch worker if the batch section is filled in: (see

    batch: Optional[BatchConfig] = Field(
    ). You can specify a maximum number of jobs for the worker (in the worker.max_jobs value and the Runner will take care to keep a given number of jobs in the SLURM queue (given by min(number of current jobs assigned to the worker, max_jobs in the settings). You can also specify the maximum number of jobflow's jobs executed in each batch slurm job using the worker.batch.max_jobs setting (this should be equivalent to a rlaunch rapidfire --nlaunches setting in fireworks). Would this fit your use case?
    I consider this still an experimental feature, as it has not been well tested, but I would be happy to hear your feedback if you are willing to try it.

@jmmshn
Copy link
Author

jmmshn commented Jan 15, 2024

Thanks for the info!

  1. I am not sure if this is what your were implying, but we have basically constructed jobflow-remote in such a way that the node would never access the DB. While it would be possible to change this point to also allow a connection from a SLURM job to the DB, I think that this could open a series of issues, given that the interactions with the DB were not designed with this in mind. At this stage I would thus refrain from introducing such a change.

OK, I understand things better now. I guess from my POV, whether a SLURM job can access the DB is not super critical as long as long as the RUNNER and WORKER are on the same side of the SSH login (I really don't know how common of a problem this is since many clusters are moving into heavy 2FA configurations). The code change I was suggesting was just some simple options for the type of WORKER and runners to make sure config and check functionality works but it should not be influential to how the code functions at all.

@jmmshn
Copy link
Author

jmmshn commented Jan 17, 2024

am not sure if this is what your were implying, but we have basically constructed jobflow-remote in such a way that the node would never access the DB.

This brings up another usage scenario question: I have another cluster that is not able to talk to my main DB so this seems like the exact situation you had in mind. But often the atomate2 jobs require access to the DB pass data between each other. The atomate2 config will just call a VASP_CMD https://github.com/materialsproject/atomate2/blob/4869a352e65b9b68fae5283c5f94d9ed36c09207/src/atomate2/settings.py#L45
And expect the job to have access to a DB before writing the data.

I've skimmed the docs and code but cannot figure out how this is suppose to work with atomate2. Is there a basic configuration somewhere that I can look at?

@gpetretto
Copy link
Contributor

I am not entirely sure if I understand correctly what you mean. If you are referring to the fact that some atomate2 Jobs have references to the outputs of other Jobs already completed (e.g. a NSCF calculation needs the outputs of a previous SCF calculation), this is already handled automatically. All the referenced are resolved by the Runner before uploading the files to the worker. In the same way the outputs are written to a file, retrieved by the Runner and inserted into the database. In this case there is nothing to set up.
If instead there are atomate2 Jobs that actively access the DB during the job execution (i.e. inside a @job function or in the make method of a Maker), this would indeed not be possible. However, if the cluster does not have access to the DB there is no way around it. The only way would be to refactor the Job so that the data that needs to be fetched from the DB is a given as reference in the input of the Job. In this way jobflow-remote will retrieve the data from the DB as in the previous case.

Does this answer your question?

@jmmshn
Copy link
Author

jmmshn commented Jan 17, 2024

OK I think I'm almost at the eureka moment :)

I am not entirely sure if I understand correctly what you mean. If you are referring to the fact that some atomate2 Jobs have references to the outputs of other Jobs already completed (e.g. a NSCF calculation needs the outputs of a previous SCF calculation), this is already handled automatically. All the references are resolved by the Runner before uploading the files to the worker. In the same way the outputs are written to a file, retrieved by the Runner, and inserted into the database. In this case, there is nothing to set up.

While most of the base Vasp wfs use something like prv_vasp_dir to get the data from that last job, some other job, might require something like a dynamically created response.output.chgcar. But it sounds like the files will be accessed by the RUNNER again to do the processing. The point that I'm still confused about is:

  • How does output retrieval work with the RUNNER?
  • Is the output serialized first and copied over the RUNNER? If so what happens to large objects meant for the additional stores. Or is the entire output directory copied over?

Since some of my workflows rely heavily on the S3Store I wanna make sure I understand this really well so I can figure out if everything is feasible.

@gpetretto
Copy link
Contributor

Thanks for clarifying the issue. Your questions highlighted that these details really need to be addressed in the documentation.
I will try to start giving here a more detailed explanation.
Let me refer to the states evolution schema: https://matgenix.github.io/jobflow-remote/user/states.html#evolution.
The Runner process is started as a daemon on the RUNNER machine. The Runner loops over the different actions that it can perform and updates the state of Jobs in the DB performing some actions on them.

  • After a Job has been CHEKED_OUT, the Runner will proceed to upload the information required to run a Job. This includes:

    • resolving all the references of the Job from the DB (including everything in additional stores)
    • using that to generate a JSON representation of the Job without external references
    • uploading a json file with this information on the runner

    At this point the state of the Job is UPLOADED

  • Generate a SLURM (or PBS) submission script. Uploads it and submit the job. The Job is now SUBMITTED

  • When the SLURM job starts running, the code deserializes the Job and executes its run method. Since All references are already resolved no access to the DB is needed. the Store passed to the run method is a JSONStore, so the outputs are also stored as JSON files and do not access the DB.

  • The Runner monitors the state of the SLURM Job. When it is completed marks the Job as TERMINATED.

  • In the next step the Runner fetch the json file containing the outputs locally (this step is skipped if the Worker is a local worker)

  • Finally, if everything went fine, it will use the downloaded output file to insert the data in the real output Store.

So basically, whichever maggma Store you use in your configuration, it is never accessed from the running Job, but only from the Runner. I hope this better clarifies how the code works. Let me know if you have more specific question.

To better understand your user case, when you refer to response.output.chgcar, do you mean that the full CHGCAR file is stored in the S3Store and needs to be retrieved as well?

@jmmshn
Copy link
Author

jmmshn commented Jan 17, 2024

Great I think I get it now!
So all WORKER does is recieve JSON files and produce new JSON files.

To better understand your user case, when you refer to response.output.chgcar, do you mean that the full CHGCAR file is stored in the S3Store and needs to be retrieved as well?

Correct, using the charge density for subsequent steps is a kind of niche application, but many other workflows store the charge density so the need for copying the serialized charge density either to or from the WORKER should be universal.
I don't know how many other people utilize the S3Store extensively in their workflows but since I wrote it I use it quite a bit for my work. So I just want to make sure there are no restrictions when moving that kind of nested large data blocks back and forth.

From what I gather, as long as the data serializes properly in a JSON file then everything should be fine. I will test it and see.

@gpetretto
Copy link
Contributor

To add a bit more information, we have used this together with maggma's AzureBlobStore, that we implemented modeling it on the S3Store and I can confirm that it works fine for basic atomate2 workflows. However, these typically use the additional store for data like band structure, which are limited in size. I would be very interested in the outcome of your test with larger data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants