Skip to content

Conversation

@ruiqi-zhong
Copy link
Contributor

adding name to NLL evaluator. otherwise we cannot have two different nll evaluators and log their results properly.



class NLLEvaluator(TrainingClientEvaluator):
def __init__(self, data: list[tinker.Datum]):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually wouldn't it be better to make the name default to "test", and remove the prefixing of "test" from the training loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that's better -- I thought the original code meant to always want "test/" to be the prefix of the log name but that's indeed a bit cumbersome.

@claude
Copy link

claude bot commented Nov 23, 2025

Claude finished @ruiqi-zhong's task —— View job


Code Review Complete ✓

Todo List:

  • Read changed files to understand the implementation
  • Analyze the changes and design decisions
  • Review joschu's feedback about default naming
  • Provide comprehensive review feedback

Summary

This PR successfully adds a name parameter to the NLL evaluator to allow multiple NLL evaluators with proper logging. The implementation is sound and follows the existing codebase patterns well. However, there's an important design consideration raised by joschu's review comment that should be addressed.

Key Changes

  1. tinker_cookbook/supervised/nll_evaluator.py - Added name parameter with default "test", metrics now prefixed with {name}/nll
  2. tinker_cookbook/rl/metric_util.py - Similar changes to RLTestSetEvaluator, added eval_token_completer method
  3. tinker_cookbook/supervised/train.py - Removed test/ prefix from run_evals (line 123)
  4. tinker_cookbook/rl/train.py - Removed test/ prefix from run_single_evaluation (line 276)

Design Discussion: Where Should the Prefix Be Applied?

joschu's suggestion (tinker_cookbook/supervised/nll_evaluator.py:10):

"actually wouldn't it be better to make the name default to "test", and remove the prefixing of "test" from the training loop?"

Current implementation: ✅ Already follows this approach!

  • The evaluator now defaults name="test" and applies the prefix itself: f"{self.name}/nll"
  • The training loops no longer add test/ prefix (removed from both supervised/train.py:123 and rl/train.py:276)

This is the correct design because:

  1. Evaluators own their metric namespacing - Each evaluator knows its name and prefixes its own metrics
  2. Multiple evaluators work correctly - You can now have train_evaluator = NLLEvaluator(train_data, name="train") and test_evaluator = NLLEvaluator(test_data, name="test") without conflicts
  3. Consistency - Both NLLEvaluator and RLTestSetEvaluator follow the same pattern

Code Quality Assessment

Strengths:

  • ✅ Consistent implementation across both supervised and RL evaluators
  • ✅ Good use of default parameters (name: str = "test")
  • ✅ Proper refactoring of metric prefixing logic from training loops to evaluators
  • ✅ Added eval_token_completer method in RLTestSetEvaluator for better separation of concerns

Minor Issues:

  • ⚠️ Unused import: tinker_cookbook/rl/metric_util.py:14 adds from tinker_cookbook.completers import TokenCompleter which is used in the new eval_token_completer method, so this is correct.
  • ℹ️ The conditional check if self.name is not None: was removed in favor of always prefixing, which is fine since name now has a default value.

Recommendations

  1. No changes needed - The implementation already follows joschu's suggestion correctly
  2. Consider documenting the pattern - It might be helpful to add a comment in the evaluator base classes explaining that evaluators should prefix their own metrics
  3. Type hint consistency - The change from name: str | None = None to name: str = "test" is good and makes the API clearer

Testing Considerations

To verify this works correctly, consider testing:

  • Multiple NLL evaluators with different names log separately (e.g., train/nll and val/nll)
  • The default test/nll still appears when no name is provided
  • RL test set evaluator metrics are properly prefixed (e.g., test/env/all/reward/total)

Conclusion

Approval Status: ✅ LGTM (with clarification that joschu's suggestion is already implemented)

The PR successfully achieves its goal and the design is sound. The evaluators now properly namespace their metrics, enabling multiple evaluators of the same type to coexist with distinct names.


@claude
Copy link

claude bot commented Nov 23, 2025

Claude finished @ruiqi-zhong's task —— View job


Code Review Complete ✓

Todo List:

  • Read changed files to understand the implementation
  • Analyze the changes and verify joschu's feedback has been addressed
  • Check for any issues or improvements
  • Provide comprehensive review feedback

Summary

This PR successfully implements the ability to have multiple NLL evaluators with distinct names for proper metric logging. The implementation correctly follows joschu's suggestion to default the name to "test" and have evaluators prefix their own metrics, removing the prefix logic from training loops.

Key Changes Review

1. tinker_cookbook/supervised/nll_evaluator.py

  • Lines 10-11: Added name: str = "test" parameter with good default
  • Lines 19-20: Metrics now prefixed as f"{self.name}/nll"
  • Lines 24-26: from_dataset classmethod properly forwards the name parameter

Assessment: Clean implementation. The evaluator now owns its metric namespace.

2. tinker_cookbook/rl/metric_util.py

  • Line 14: Added from tinker_cookbook.completers import TokenCompleter import (needed for new method)
  • Line 111: Changed from name: str | None = None to name: str = "test"
  • Lines 119-136: Refactored to extract eval_token_completer method, improving separation of concerns
  • Line 131: Now always applies prefix f"{self.name}/{k}" (no conditional check needed)

Assessment: Excellent refactoring. The new eval_token_completer method accepting a TokenCompleter interface is a nice abstraction that makes testing easier.

3. tinker_cookbook/supervised/train.py

  • Line 123: Removed test/ prefix wrapping - evaluators now handle their own prefixes

Assessment: Correct. Training loop no longer needs to know about metric namespacing.

4. tinker_cookbook/rl/train.py

  • Line 276: Removed test/ prefix wrapping - consistent with supervised changes

Assessment: Correct and consistent with the supervised training loop.


Design Analysis

joschu's suggestion (addressed in review thread):

"actually wouldn't it be better to make the name default to "test", and remove the prefixing of "test" from the training loop?"

Status: ✅ Fully implemented

This is the correct architectural choice because:

  1. Single Responsibility: Evaluators own their metric namespacing
  2. Flexibility: Users can now create multiple evaluators with different names:
    train_evaluator = NLLEvaluator(train_data, name="train")
    val_evaluator = NLLEvaluator(val_data, name="val")
    test_evaluator = NLLEvaluator(test_data, name="test")
  3. Consistency: Both NLLEvaluator and RLTestSetEvaluator follow the same pattern
  4. No Breaking Changes: The default name="test" preserves existing behavior

Code Quality

Strengths:

  • ✅ Consistent implementation across both supervised and RL evaluators
  • ✅ Good use of default parameters maintains backward compatibility
  • ✅ Proper parameter forwarding in factory methods (from_dataset)
  • ✅ The eval_token_completer extraction in RLTestSetEvaluator improves testability

No Issues Found:

  • Type hints are correct and consistent
  • The TokenCompleter import is properly used
  • No unused variables or dead code
  • Backward compatibility maintained through sensible defaults

Testing Recommendations

While I cannot run tests, consider verifying:

  1. Multiple evaluators work correctly:

    evaluators = [
        NLLEvaluator.from_dataset(train_data, name="train"),
        NLLEvaluator.from_dataset(val_data, name="val"),
        NLLEvaluator.from_dataset(test_data, name="test"),
    ]

    Should produce distinct metrics: train/nll, val/nll, test/nll

  2. Default behavior preserved:

    evaluator = NLLEvaluator(data)  # Should still log as "test/nll"
  3. RL evaluator metrics properly namespaced:

    evaluator = RLTestSetEvaluator(dataset, max_tokens=100, name="validation")
    # Metrics should be: validation/env/all/reward/total, etc.

Final Verdict

✅ LGTM - Approved

This PR cleanly solves the stated problem ("we cannot have two different nll evaluators and log their results properly") with a well-architected solution. The implementation follows joschu's architectural guidance perfectly, and the code quality is high.

The change is backward compatible, improves flexibility, and follows the Single Responsibility Principle by having evaluators manage their own metric namespacing.


@ruiqi-zhong ruiqi-zhong merged commit 64d4469 into main Nov 23, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants