-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Added support for PyTorch Lightning in the DDP backend. #162
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems legit, just a question from my own lack of knowing about DDP.
How is the user expected to actually launch a DDP run? Seems everything here is set up with the expectation that they lauch NePS several times, where the rank0
run will pick up the trials and the rest...?
Given this, can you actually run two DDP runs in tandem?
@eddiebergman In the current approach, Without any modifications, for each NePS launch, a new worker would be created and registered for the same pipeline, which is undesirable. Instead, we try to register a worker only for the NePS launch that is on However, currently I believe this method is not compatible with multiple DDP runs in tandem. In that case:
Here,
Now in this case, if we poll the current evaluating config from |
There are some filesystem/worker changes come about from #161 but in the meantime, I think the most manageable approach would be that if I'm not sure how the DDP sub-process spawning is handled by lightning but if it's possible to pass ENV vars to the spawned workers, then this could be the easiest way to make this work. For example, all workers spawned by if (config_id := os.getenv("NEPS_DDP_PROCESS_TRIAL")) is not None:
... # do what you are doing for ddp workers
else:
... # Regular flow Do you think this could work? It's definitely possible with the default |
@eddiebergman, the approach you mention may work but, do you believe it would be possible for DDP spawned workers to pick up the config from their parent worker only, and not from the other parent worker of another DDP run? I have made a new commit (3af8969) that now allows the non-rank0 processes to pick up the config only from the same parent worker at all times. However, this approach might fail if one of the spawned workers picks up the config from a wrong parent in the beginning. That seems to be the only failure case of the approach I implemented. Please let me know any suggestions. |
Assuming you can pass env vars, I don't see why it would work and doesn't risk any might pick up the wrong config, i.e. "hey, I am rank0 process and I have sampled config_14, and you are my spawned DDP worker with your env var Seems much easier than having to read the entire state, figure out which is evaluating, make sure to hopefully pick up the correct one. Might be missing something here though but this eliminates all guesswork? |
@eddiebergman, I tried to implement the approach as you described but the issue is with the way DDP process groups get initialized with |
I get your point, makes sense. I would need to see how the ddp workers get spawned. But if the spawned workers share the exact same If they don't share the same But at any rate, we need to transfer information for the main worker to the spawned workers, that is unique to those spawned workers i.e. imagine workers main1 and main2 spawn 8 sub workers each. They sub workers need some information to know which is their parent to do the right thing. Guessing which config to pick up is just going to lead to more issues later on, and even worse, it might be a silent issue that we would never know about it as there's no explicit error that would be raised For example, in the case above, we could have 14/16 sub workers picked up config1 from main1 and only 2/16 subworkers pick up config2 from main2. You might start concluding somehow the hyperparameters are effecting runtime, where really it's just a silent bug on our part. |
@eddiebergman, I have combined both your approach and mine to make the config sharing among workers free of any risks. It's easy to share the Environment variable in the beginning (before process group initialization) because the children processes inherit the parent environment. However, for the subsequently sampled Trials, we have to manually change the environment variables in the children's environments with the id of the newly sampled Trial, which requires Tensor broadcasting or something similar, making the process even complex. Here's how the current approach works:
I have tested this approach for multiple DDP runs and it seems to work each time. Please let me know what you think. If this looks fine, I can make another commit with a NePS example for pytorch-lightning with DDP. |
This pull request includes several changes to improve the handling of distributed data parallel (DDP) setups and trial evaluation in the
neps
runtime. The changes focus on adding support for evaluating trials in a DDP context and ensuring proper state management.DDP and Trial Evaluation Enhancements:
_is_ddp_and_not_rank_zero
to check if the current process is part of a DDP setup and is not the rank zero process. (neps/runtime.py
, neps/runtime.pyR49-R66)_launch_ddp_runtime
function to handle the evaluation of trials in a DDP setup. This function ensures that only the rank zero process launches a new worker. (neps/runtime.py
, neps/runtime.pyR512-R531)_launch_runtime
function to use_launch_ddp_runtime
when in a DDP setup and not rank zero. This prevents non-rank-zero processes from launching new workers. (neps/runtime.py
, neps/runtime.pyR550-R556)State Management Improvements:
evaluating
method to theFileBasedTrialStore
class to retrieve all evaluating trials. (neps/state/filebased.py
, neps/state/filebased.pyR212-R220)get_current_evaluating_trial
method to theNepsState
class to get the current trial being evaluated. (neps/state/neps_state.py
, neps/state/neps_state.pyR217-R222)evaluating
method in theTrialStore
protocol to standardize the retrieval of evaluating trials across different implementations. (neps/state/protocols.py
, neps/state/protocols.pyR141-R144)These changes collectively enhance the
neps
runtime's ability to manage and evaluate trials efficiently, especially in distributed computing environments.