[adding name to NLL evaluator] #104

ruiqi-zhong · 2025-11-20T17:59:33Z

adding name to NLL evaluator. otherwise we cannot have two different nll evaluators and log their results properly.

joschu · 2025-11-21T17:31:21Z

tinker_cookbook/supervised/nll_evaluator.py



 class NLLEvaluator(TrainingClientEvaluator):
-    def __init__(self, data: list[tinker.Datum]):


actually wouldn't it be better to make the name default to "test", and remove the prefixing of "test" from the training loop?

yeah that's better -- I thought the original code meant to always want "test/" to be the prefix of the log name but that's indeed a bit cumbersome.

claude · 2025-11-23T14:53:41Z

Claude finished @ruiqi-zhong's task —— View job

Code Review Complete ✓

Todo List:

Read changed files to understand the implementation
Analyze the changes and design decisions
Review joschu's feedback about default naming
Provide comprehensive review feedback

Summary

This PR successfully adds a name parameter to the NLL evaluator to allow multiple NLL evaluators with proper logging. The implementation is sound and follows the existing codebase patterns well. However, there's an important design consideration raised by joschu's review comment that should be addressed.

Key Changes

tinker_cookbook/supervised/nll_evaluator.py - Added name parameter with default "test", metrics now prefixed with {name}/nll
tinker_cookbook/rl/metric_util.py - Similar changes to RLTestSetEvaluator, added eval_token_completer method
tinker_cookbook/supervised/train.py - Removed test/ prefix from run_evals (line 123)
tinker_cookbook/rl/train.py - Removed test/ prefix from run_single_evaluation (line 276)

Design Discussion: Where Should the Prefix Be Applied?

joschu's suggestion (tinker_cookbook/supervised/nll_evaluator.py:10):

"actually wouldn't it be better to make the name default to "test", and remove the prefixing of "test" from the training loop?"

Current implementation: ✅ Already follows this approach!

The evaluator now defaults name="test" and applies the prefix itself: f"{self.name}/nll"
The training loops no longer add test/ prefix (removed from both supervised/train.py:123 and rl/train.py:276)

This is the correct design because:

Evaluators own their metric namespacing - Each evaluator knows its name and prefixes its own metrics
Multiple evaluators work correctly - You can now have train_evaluator = NLLEvaluator(train_data, name="train") and test_evaluator = NLLEvaluator(test_data, name="test") without conflicts
Consistency - Both NLLEvaluator and RLTestSetEvaluator follow the same pattern

Code Quality Assessment

Strengths:

✅ Consistent implementation across both supervised and RL evaluators
✅ Good use of default parameters (name: str = "test")
✅ Proper refactoring of metric prefixing logic from training loops to evaluators
✅ Added eval_token_completer method in RLTestSetEvaluator for better separation of concerns

Minor Issues:

⚠️ Unused import: tinker_cookbook/rl/metric_util.py:14 adds from tinker_cookbook.completers import TokenCompleter which is used in the new eval_token_completer method, so this is correct.
ℹ️ The conditional check if self.name is not None: was removed in favor of always prefixing, which is fine since name now has a default value.

Recommendations

No changes needed - The implementation already follows joschu's suggestion correctly
Consider documenting the pattern - It might be helpful to add a comment in the evaluator base classes explaining that evaluators should prefix their own metrics
Type hint consistency - The change from name: str | None = None to name: str = "test" is good and makes the API clearer

Testing Considerations

To verify this works correctly, consider testing:

Multiple NLL evaluators with different names log separately (e.g., train/nll and val/nll)
The default test/nll still appears when no name is provided
RL test set evaluator metrics are properly prefixed (e.g., test/env/all/reward/total)

Conclusion

Approval Status: ✅ LGTM (with clarification that joschu's suggestion is already implemented)

The PR successfully achieves its goal and the design is sound. The evaluators now properly namespace their metrics, enabling multiple evaluators of the same type to coexist with distinct names.

claude · 2025-11-23T15:00:49Z

Claude finished @ruiqi-zhong's task —— View job

Code Review Complete ✓

Todo List:

Read changed files to understand the implementation
Analyze the changes and verify joschu's feedback has been addressed
Check for any issues or improvements
Provide comprehensive review feedback

Summary

This PR successfully implements the ability to have multiple NLL evaluators with distinct names for proper metric logging. The implementation correctly follows joschu's suggestion to default the name to "test" and have evaluators prefix their own metrics, removing the prefix logic from training loops.

Key Changes Review

1. `tinker_cookbook/supervised/nll_evaluator.py` ✅

Lines 10-11: Added name: str = "test" parameter with good default
Lines 19-20: Metrics now prefixed as f"{self.name}/nll"
Lines 24-26: from_dataset classmethod properly forwards the name parameter

Assessment: Clean implementation. The evaluator now owns its metric namespace.

2. `tinker_cookbook/rl/metric_util.py` ✅

Line 14: Added from tinker_cookbook.completers import TokenCompleter import (needed for new method)
Line 111: Changed from name: str | None = None to name: str = "test"
Lines 119-136: Refactored to extract eval_token_completer method, improving separation of concerns
Line 131: Now always applies prefix f"{self.name}/{k}" (no conditional check needed)

Assessment: Excellent refactoring. The new eval_token_completer method accepting a TokenCompleter interface is a nice abstraction that makes testing easier.

3. `tinker_cookbook/supervised/train.py` ✅

Line 123: Removed test/ prefix wrapping - evaluators now handle their own prefixes

Assessment: Correct. Training loop no longer needs to know about metric namespacing.

4. `tinker_cookbook/rl/train.py` ✅

Line 276: Removed test/ prefix wrapping - consistent with supervised changes

Assessment: Correct and consistent with the supervised training loop.

Design Analysis

joschu's suggestion (addressed in review thread):

"actually wouldn't it be better to make the name default to "test", and remove the prefixing of "test" from the training loop?"

Status: ✅ Fully implemented

This is the correct architectural choice because:

Single Responsibility: Evaluators own their metric namespacing

Flexibility: Users can now create multiple evaluators with different names:

train_evaluator = NLLEvaluator(train_data, name="train")
val_evaluator = NLLEvaluator(val_data, name="val")
test_evaluator = NLLEvaluator(test_data, name="test")

Consistency: Both NLLEvaluator and RLTestSetEvaluator follow the same pattern
No Breaking Changes: The default name="test" preserves existing behavior

Code Quality

Strengths:

✅ Consistent implementation across both supervised and RL evaluators
✅ Good use of default parameters maintains backward compatibility
✅ Proper parameter forwarding in factory methods (from_dataset)
✅ The eval_token_completer extraction in RLTestSetEvaluator improves testability

No Issues Found:

Type hints are correct and consistent
The TokenCompleter import is properly used
No unused variables or dead code
Backward compatibility maintained through sensible defaults

Testing Recommendations

While I cannot run tests, consider verifying:

Multiple evaluators work correctly:

evaluators = [
    NLLEvaluator.from_dataset(train_data, name="train"),
    NLLEvaluator.from_dataset(val_data, name="val"),
    NLLEvaluator.from_dataset(test_data, name="test"),
]

Should produce distinct metrics: train/nll, val/nll, test/nll

Default behavior preserved:

evaluator = NLLEvaluator(data)  # Should still log as "test/nll"

RL evaluator metrics properly namespaced:

evaluator = RLTestSetEvaluator(dataset, max_tokens=100, name="validation")
# Metrics should be: validation/env/all/reward/total, etc.

Final Verdict

✅ LGTM - Approved

This PR cleanly solves the stated problem ("we cannot have two different nll evaluators and log their results properly") with a well-architected solution. The implementation follows joschu's architectural guidance perfectly, and the code quality is high.

The change is backward compatible, improves flexibility, and follows the Single Responsibility Principle by having evaluators manage their own metric namespacing.

ruiqi-zhong added 2 commits November 20, 2025 17:57

adding name to nll evaluator

61ff912

Merge branch 'main' into nll_eval_name

ec7bc3d

ruiqi-zhong requested review from joschu and kennyyu November 20, 2025 17:59

kennyyu approved these changes Nov 20, 2025

View reviewed changes

ruiqi-zhong added 2 commits November 20, 2025 19:38

b

64dc79b

b

be8a62c

joschu reviewed Nov 21, 2025

View reviewed changes

default eval name to test

9fd9678

b

59eb13d

ruiqi-zhong merged commit 64d4469 into main Nov 23, 2025
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[adding name to NLL evaluator] #104

[adding name to NLL evaluator] #104

ruiqi-zhong commented Nov 20, 2025

Uh oh!

joschu Nov 21, 2025

Uh oh!

ruiqi-zhong Nov 23, 2025

Uh oh!

claude bot commented Nov 23, 2025 •

edited

Loading

Uh oh!

claude bot commented Nov 23, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants



		class NLLEvaluator(TrainingClientEvaluator):
		def __init__(self, data: list[tinker.Datum]):

[adding name to NLL evaluator] #104

[adding name to NLL evaluator] #104

Conversation

ruiqi-zhong commented Nov 20, 2025

Uh oh!

joschu Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

ruiqi-zhong Nov 23, 2025

Choose a reason for hiding this comment

Uh oh!

claude bot commented Nov 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Complete ✓

Summary

Key Changes

Design Discussion: Where Should the Prefix Be Applied?

Code Quality Assessment

Recommendations

Testing Considerations

Conclusion

Uh oh!

claude bot commented Nov 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Complete ✓

Summary

Key Changes Review

1. tinker_cookbook/supervised/nll_evaluator.py ✅

2. tinker_cookbook/rl/metric_util.py ✅

3. tinker_cookbook/supervised/train.py ✅

4. tinker_cookbook/rl/train.py ✅

Design Analysis

Code Quality

Testing Recommendations

Final Verdict

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

claude bot commented Nov 23, 2025 •

edited

Loading

claude bot commented Nov 23, 2025 •

edited

Loading

1. `tinker_cookbook/supervised/nll_evaluator.py` ✅

2. `tinker_cookbook/rl/metric_util.py` ✅

3. `tinker_cookbook/supervised/train.py` ✅

4. `tinker_cookbook/rl/train.py` ✅