Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revisit and maybe optimize Collectors #1069

Open
MischaPanch opened this issue Mar 4, 2024 · 0 comments
Open

Revisit and maybe optimize Collectors #1069

MischaPanch opened this issue Mar 4, 2024 · 0 comments
Labels
optimization Performance optimization (throughout, memory, processing speed) tentative Up to discussion, may be dismissed
Milestone

Comments

@MischaPanch
Copy link
Collaborator

          The main assumption Tianshou holds is that batch-style data transfer can reduce a lot of overhead, so we can improve GPU utilization by sending batch data and the overall system throughput. That's why the initial version of the collector is in batch style.

There are some constraints in front of this assumption:

  1. We cannot sequentially send data to GPU to achieve the same throughput as batch-style easily
  2. The model is relatively small, and it's not memory-bound
  3. The Environment step function takes a small amount of time (including reward calculation), at least shorter than policy forward

These are very strong constraints. If either is not true, we can switch to full async rollout implementation to get better throughput, i.e., achieving shorter wall-clock collector.collect time. For example, in RLHF case:

  • LLM's completion function can be implemented in a fully-async style to achieve the same throughput as batch completion, as long as you provide enough thread/process to handle per request. That invalids (1) (2);
  • The environment needs a reward model to calculate rewards. If we do things in batch-style, we have to do all policy sampling first, sync, and do reward calculation. The system might be environment throughput bound by not investigating enough compute for reward. But if you can do policy/reward calculation in a fully async way, you can remove all bubbles. That invalids (3).

Originally posted by @Trinkle23897 in #1058 (comment)

@MischaPanch MischaPanch added tentative Up to discussion, may be dismissed optimization Performance optimization (throughout, memory, processing speed) labels Mar 4, 2024
@MischaPanch MischaPanch added this to the Release 2.0.0 milestone Mar 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimization Performance optimization (throughout, memory, processing speed) tentative Up to discussion, may be dismissed
Projects
Development

No branches or pull requests

1 participant