Add TextArena and Connect4 rubric examples (RFC 004)#341
Add TextArena and Connect4 rubric examples (RFC 004)#341Darktex wants to merge 1 commit intofeat/rubrics-corefrom
Conversation
Demonstrates rubric integration patterns with two environments: TextArena (Wordle): - WordleRubric composite with greens, yellows, repetitions, correct - Migrates from legacy RewardProvider to Rubric pattern - Full backwards compatibility via get_reward_signals() Connect4: - Connect4WinLossRubric trajectory rubric for terminal games - Demonstrates exponential discounting for credit assignment - Shows reset() integration with environment lifecycle 30 tests covering both environment rubrics.
Greptile OverviewGreptile SummaryThis PR demonstrates successful rubric integration patterns for two environments per RFC 004. Key Changes
Architecture AlignmentThe implementation correctly follows RFC 004 patterns:
Both implementations respect the "rewards inside environment" principle (INVARIANTS.md), keeping reward computation server-side within the environment boundary. Confidence Score: 5/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Training as Training Loop
participant Env as Environment
participant Rubric as Rubric
participant Game as Game Logic
Note over Training,Game: Episode Start
Training->>Env: reset()
Env->>Rubric: reset()
Note over Rubric: Clear trajectory buffer
Env->>Game: Initialize game state
Env-->>Training: Initial observation
Note over Training,Game: Game Loop
loop Until done
Training->>Env: step(action)
Env->>Game: Apply action
Game-->>Env: New game state
Env->>Rubric: __call__(action, observation)
alt Not Done (intermediate step)
Rubric->>Rubric: Append to trajectory
Rubric-->>Env: 0.0 (intermediate reward)
else Done (terminal step)
Rubric->>Rubric: Append to trajectory
Rubric->>Rubric: score_trajectory()
Note over Rubric: Compute final score<br/>(win=1.0, loss=0.0, draw=0.5)
Rubric-->>Env: Final score
end
Env-->>Training: Observation with reward
end
Note over Training,Game: Episode Complete
Training->>Rubric: compute_step_rewards()
Note over Rubric: Apply discounting:<br/>r_t = gamma^(T-1-t) * final_score
Rubric-->>Training: Per-step rewards for training
|
|
@kashif @sergiopaniego this is relevant to TRL examples |
Summary
Demonstrates rubric integration patterns with two environments per RFC 004:
TextArena (Wordle)
WordleRubriccomposite rubric with greens, yellows, repetitions, correct sub-rubricsRewardProvidertoRubricpatternget_reward_signals()Connect4
Connect4WinLossRubrictrajectory rubric for terminal gamesreset()integration with environment lifecycleChanges
New files
envs/textarena_env/rubrics.py- Wordle rubric implementationenvs/connect4_env/rubrics.py- Connect4 trajectory rubrictests/envs/test_textarena_rubrics.py- 18 teststests/envs/test_connect4_rubrics.py- 12 testsModified files
envs/textarena_env/server/environment.py- Use rubric instead of RewardProviderenvs/connect4_env/server/connect4_environment.py- Add optional rubric supportTest plan
Dependencies
This PR depends on #340 (Rubric base system).