Skip to content

Late-stage collapse of critic/score/mean in multihopqa-grpo-group4-qwen2.5_7b #5

@e3trange

Description

@e3trange

Hi, thanks for releasing the Tree-GRPO code and experiments.

I noticed a possible late-stage training instability in the provided
multihopqa-grpo-group4-qwen2.5_7b run. Specifically, the logged metric
critic/score/mean first increases normally and stays around 0.6 for a long
period, but then suddenly decreases sharply near the end of training.

In addition, I tried running a similar experiment with Qwen2.5-3B-instruct and observed
a similar phenomenon: the reward/score collapses, and in my run the final
scores become all zeros.

Could you help clarify whether this behavior is expected? In particular:

  1. Is the drop in critic/score/mean caused by policy degradation,
    reward parsing failures, or some evaluation/logging issue?
  2. Are there recommended hyperparameters to avoid this collapse, e.g.,
    smaller learning rate, stronger KL regularization, early stopping, or a
    smaller rollout/tree expansion budget?

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions