You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Instead of the fut.result() for each param, would save the dispatch latency if we call update_weight and broadcast for every single param, and then wait on all the futs. My understanding is that they will be dispatched as a series of nccl calls, and will respect the order they are dispatched.
It may be possible to broadcast different params from different learners, so that the communication bandwidth is maximally used. But with some caveats, 1. we may need different communication groups; 2. we need some coordination mechanism to make sure the broadcast / update_weight pairs are in the right order. Is it possible to add all the actors to the deepspeed communication group so that they get parameter updates? But without having them participate in the training.
Can we avoid this polling? An idea is to create a nccl broadcast independent of the vllm update_weight call. The actors receive weight updates in another thread and cache them. Then in the step function of actor, we check this cache and update the weights when they are available. In this way we maximize the communication computation overlap.
The text was updated successfully, but these errors were encountered:
Some thoughts for improvements
oat/oat/learners/base.py
Lines 582 to 608 in 4540740
fut.result()
for each param, would save the dispatch latency if we callupdate_weight
andbroadcast
for every single param, and then wait on all the futs. My understanding is that they will be dispatched as a series of nccl calls, and will respect the order they are dispatched.broadcast
/update_weight
pairs are in the right order. Is it possible to add all the actors to the deepspeed communication group so that they get parameter updates? But without having them participate in the training.oat/oat/learners/base.py
Lines 575 to 579 in 4540740
update_weight
call. The actors receive weight updates in another thread and cache them. Then in thestep
function of actor, we check this cache and update the weights when they are available. In this way we maximize the communication computation overlap.The text was updated successfully, but these errors were encountered: