Implement TrajectoryRubric and ExponentialDiscountingTrajectoryRubric by Darktex · Pull Request #338 · meta-pytorch/OpenEnv

Darktex · 2026-01-28T22:01:57Z

Summary

Initial implementation of trajectory-based rubrics for delayed rewards, as specified in RFC 004's "Delayed Rewards" section (see #337).

What's Implemented

New Files

File	Description
`src/openenv/core/rubrics/__init__.py`	Package exports
`src/openenv/core/rubrics/base.py`	`Rubric` base class with nn.Module-like API
`src/openenv/core/rubrics/trajectory.py`	`TrajectoryRubric` and `ExponentialDiscountingTrajectoryRubric`
`tests/core/test_rubrics/test_base_rubric.py`	Base class tests (15 tests)
`tests/core/test_rubrics/test_trajectory_rubric.py`	Trajectory rubric tests (23 tests)

Rubric Base Class

forward(action, observation) -> float: Abstract method to implement
__call__(): Sync evaluation with pre/post hooks
Child auto-registration when rubrics assigned as attributes
children(), named_children(), rubrics(), named_rubrics(): Iteration
get_rubric(path): Access nested rubrics by dot-separated path
state_dict() / load_state_dict(): Serialization support
last_score: Tracks most recent evaluation result

TrajectoryRubric

Accumulates (action, observation) pairs internally
Returns intermediate_reward until observation.done=True
Abstract score_trajectory(trajectory): Compute final score
Abstract compute_step_rewards(): Define credit assignment strategy
reset(): Clear trajectory on env.reset()
trajectory: Read-only property for current trajectory

ExponentialDiscountingTrajectoryRubric

Standard gamma-based discounting: r_t = gamma^(T-1-t) * R_final
gamma=1.0: Equal credit to all steps
gamma=0.0: Only final step gets reward
gamma=0.99: Standard RL discounting (later steps get more)

Current Status

This PR provides the core infrastructure for trajectory-based rubrics. The classes are fully functional and tested, but not yet integrated with environments.

What Works

Creating custom trajectory rubrics by subclassing
Accumulating trajectories during episodes
Computing discounted per-step rewards
State serialization/deserialization
Hook-based observability
38 tests all passing

What's Missing (Follow-up PRs)

PR	Description	Dependencies
Environment Integration	Update `Environment` base class to require `rubric` attribute and call rubric during `step()`	This PR
Container Rubrics	`Sequential`, `Gate`, `WeightedSum`, `RubricList`, `RubricDict`	This PR
LLMJudge	Rubric that calls LLM via MCP for evaluation	Container Rubrics
Example Migration	Add trajectory rubric to `connect4_env` or `openspiel_env`	Environment Integration

Follow-up Plan

PR 3: Container Rubrics (next)

# New containers for rubric composition
Sequential(*rubrics)      # Fail-fast chain
Gate(rubric, threshold)   # Threshold gating  
WeightedSum(rubrics, weights)  # Weighted combination
RubricList(rubrics)       # Dynamic list container
RubricDict({name: rubric})  # Named rubric dispatch

PR 4: Environment Integration

class Environment(Generic[ActT, ObsT, StateT]):
    rubric: Rubric  # Required - must be set in __init__

    def step(self, action: ActT) -> ObsT:
        # ... execute action ...
        reward = self.rubric(action, observation)
        return observation.with_reward(reward)

PR 5: Example Migration

Migrate an existing game environment to use ExponentialDiscountingTrajectoryRubric:

class Connect4Rubric(ExponentialDiscountingTrajectoryRubric):
    def score_trajectory(self, trajectory):
        _, final_obs = trajectory[-1]
        if final_obs.winner == 'agent':
            return 1.0
        elif final_obs.winner == 'opponent':
            return 0.0
        return 0.5  # Draw

PR 6: LLMJudge (future)

class LLMJudge(Rubric):
    def __init__(self, prompt_template: str, endpoint: str):
        ...
    
    def forward(self, action, observation) -> float:
        # Call LLM via MCP for evaluation
        ...

Test Plan

Unit tests for Rubric base class (15 tests)
Unit tests for TrajectoryRubric (23 tests)
Various gamma values (0, 0.5, 0.99, 1.0)
Win/loss/draw outcomes
Edge cases (empty trajectory, single step, 100-step episodes)
State serialization roundtrip
Hook invocation on each step
Reset clears trajectory
Formatting check passes

Usage Example

from openenv.core.rubrics import ExponentialDiscountingTrajectoryRubric

class ChessRubric(ExponentialDiscountingTrajectoryRubric):
    def score_trajectory(self, trajectory):
        _, final_obs = trajectory[-1]
        outcome = final_obs.metadata.get('winner')
        if outcome == 'agent': return 1.0
        elif outcome == 'opponent': return 0.0
        return 0.5  # Draw

# Usage in environment
rubric = ChessRubric(gamma=0.99)
for action, obs in episode:
    reward = rubric(action, obs)  # 0.0 until done
step_rewards = rubric.compute_step_rewards()  # Discounted rewards
rubric.reset()  # Ready for next episode

Depends on: #337

Initial implementation of trajectory-based rubrics for delayed rewards, as specified in RFC 004's "Delayed Rewards" section. New files: - src/openenv/core/rubrics/base.py: Rubric base class with nn.Module-like API - src/openenv/core/rubrics/trajectory.py: TrajectoryRubric and ExponentialDiscountingTrajectoryRubric - src/openenv/core/rubrics/__init__.py: Package exports Tests (38 passing): - tests/core/test_rubrics/test_base_rubric.py: Base Rubric class tests - tests/core/test_rubrics/test_trajectory_rubric.py: Trajectory rubric tests See PR description for current status and follow-up plan.

greptile-apps · 2026-01-28T22:05:10Z

Greptile Overview

Greptile Summary

This PR implements the core infrastructure for trajectory-based rubrics as specified in RFC 004's "Delayed Rewards" section. The implementation introduces two main classes:

Key Changes:

Rubric base class (base.py): Abstract base with nn.Module-inspired API - implements forward(), child auto-registration, pre/post hooks, and state serialization
TrajectoryRubric (trajectory.py): Abstract base for delayed reward computation - accumulates (action, observation) pairs internally and computes final score when done=True
ExponentialDiscountingTrajectoryRubric: Concrete implementation with standard gamma-based temporal discounting (r_t = gamma^(T-1-t) * R_final)

Design Alignment:

Follows RFC 004 specification exactly
Rewards remain inside environment boundary (server-side only)
No agent exposure - rubrics are internal environment components
Not yet integrated with Environment base class (planned for follow-up PR)
Memory-conscious: trajectories stored in CPU memory only

Test Coverage:

38 tests total across base and trajectory rubrics
Covers edge cases: empty trajectories, single-step episodes, 100-step episodes
Tests various gamma values (0.0, 0.5, 0.99, 1.0)
Validates hooks, state serialization, reset behavior

Status:
This is pure infrastructure - no breaking changes, no environment integration yet. Follow-up PRs will add container rubrics (Sequential, Gate, WeightedSum) and integrate with Environment base class.

Confidence Score: 5/5

This PR is safe to merge - it adds pure infrastructure with no integration or breaking changes
Score reflects that this is well-designed infrastructure code that exactly matches RFC 004 specification, has comprehensive test coverage (38 tests), introduces no breaking changes, and adds no environment integration yet. Code quality is high with proper abstractions, error handling, and documentation.
No files require special attention - all implementations are clean and well-tested

Important Files Changed

Filename	Overview
src/openenv/core/rubrics/base.py	Implements nn.Module-like base class with forward(), hooks, child registration, and state serialization - well-structured
src/openenv/core/rubrics/trajectory.py	Trajectory accumulation with TrajectoryRubric base and ExponentialDiscountingTrajectoryRubric implementation - matches RFC 004 spec exactly
tests/core/test_rubrics/test_trajectory_rubric.py	Extensive tests for trajectory rubrics covering accumulation, discounting, reset behavior, edge cases, and various gamma values (23 tests)

Sequence Diagram

sequenceDiagram
    participant Env as Environment
    participant TR as TrajectoryRubric
    participant Trajectory as Internal Trajectory Buffer
    
    Note over Env,Trajectory: Episode Start
    Env->>TR: reset()
    TR->>Trajectory: Clear buffer []
    
    Note over Env,Trajectory: Step 1 (not done)
    Env->>TR: __call__(action1, obs1)
    TR->>TR: forward(action1, obs1)
    TR->>Trajectory: Append (action1, obs1)
    TR-->>Env: Return intermediate_reward (0.0)
    
    Note over Env,Trajectory: Step 2 (not done)
    Env->>TR: __call__(action2, obs2)
    TR->>TR: forward(action2, obs2)
    TR->>Trajectory: Append (action2, obs2)
    TR-->>Env: Return intermediate_reward (0.0)
    
    Note over Env,Trajectory: Step 3 (done=True)
    Env->>TR: __call__(action3, obs3_done)
    TR->>TR: forward(action3, obs3_done)
    TR->>Trajectory: Append (action3, obs3_done)
    TR->>TR: score_trajectory(trajectory)
    Note right of TR: Subclass implements<br/>scoring logic
    TR-->>Env: Return final_score (e.g., 1.0)
    
    Note over Env,Trajectory: Post-Episode
    Env->>TR: compute_step_rewards()
    TR->>TR: Apply discounting strategy
    Note right of TR: ExponentialDiscounting:<br/>r_t = gamma^(T-1-t) * R_final
    TR-->>Env: [r_0, r_1, r_2]

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 28, 2026

burtenshaw added feature rubrics labels Feb 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement TrajectoryRubric and ExponentialDiscountingTrajectoryRubric#338

Implement TrajectoryRubric and ExponentialDiscountingTrajectoryRubric#338
Darktex wants to merge 1 commit intorfc-004-delayed-rewardsfrom
trajectory-rubrics-impl

Darktex commented Jan 28, 2026

Uh oh!

greptile-apps bot commented Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Darktex commented Jan 28, 2026

Summary

What's Implemented

New Files

Rubric Base Class

TrajectoryRubric

ExponentialDiscountingTrajectoryRubric

Current Status

What Works

What's Missing (Follow-up PRs)

Follow-up Plan

PR 3: Container Rubrics (next)

PR 4: Environment Integration

PR 5: Example Migration

PR 6: LLMJudge (future)

Test Plan

Usage Example

Uh oh!

greptile-apps bot commented Jan 28, 2026

Greptile Overview

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants