-
Notifications
You must be signed in to change notification settings - Fork 14
refactor(graders): improve parameter validation and streaming support #62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Add threshold validation in common graders (correctness, harmfulness, etc.) - Fix streaming response handling in text_to_image grader - Preserve metadata in grader score returns - Update template default pattern (None with or fallback) - Update model name in examples from qwen3-max to qwen3-32b
Summary of ChangesHello @XiaoBoAI, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request refactors the grading system by improving parameter validation, particularly for threshold values, and enhancing the handling of streaming responses in multimodal graders. It also standardizes metadata preservation, updates template default behaviors, and ensures consistency in documentation and examples. These changes contribute to a more robust, user-friendly, and maintainable grading framework. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces several valuable improvements across the graders. The addition of threshold validation in common graders like CorrectnessGrader and HarmfulnessGrader enhances robustness. The pattern of using template or DEFAULT_TEMPLATE is a clean way to handle default prompt templates. Preserving existing metadata when adding the threshold is a good fix. The switch from logger.error to logger.exception will provide more context for debugging. The fixes in the text_to_image grader for streaming support and in criteria_utils are also solid. Overall, this is a great set of changes that improves code quality and consistency.
- Extract common score/reason parsing logic - Remove unused collected_content variable
- Move error handling to aevaluate level for consistent GraderError returns - Simplify streaming response handling across multimodal graders - Refactor image_coherence, image_helpfulness, text_to_image graders
…dling - Add parse_structured_chat_response utility for streaming/non-streaming responses - Return GraderError instead of score=0 on exceptions in multimodal graders - Update tests to verify GraderError behavior - Move exception handling to aevaluate level for cleaner code
OpenJudge Version
[The version of OpenJudge you are working on, e.g.
import openjudge; print(openjudge.__version__)]Description
[Please describe the background, purpose, changes made, and how to test this PR]
Checklist
Please check the following items before code is ready to be reviewed.
pre-commit run --all-filescommand