Skip to content

Commit b7b4b1d

Browse files
Luodiankcz358pbcongclaudecursoragent
authored
[Feat] LMMS-Eval 0.4 (#721)
* Update task utils and logger * [Main Update] Doc to messages feature support and Split simple and chat mode (#692) * Update deps * Restructured * Delete models * Remove deprecated models * Set up auto doc to messages and chat models * Lint * Allow force simple mode * Add auto doc to messages for audio and video Fix lint Init server structure Restructure to server folder Clean base and providers Add clean method for models Fix loggers save result Fix dummy server error Suppress llava warnings Sample evaluator on llava in the wild Update mmmu doc to messages Update version * Add judge server implementation with various providers and evaluation protocols Add AsyncAzureOpenAIProvider implementation and update provider factory Refactor sample saving in EvaluationTracker to use cleaned data and improve logging Add llm_as_judge_eval metric to multiple tasks and integrate llm_judge API for evaluation * Refactor MathVerseEvaluator to utilize llm_judge server for response generation and evaluation, enhancing API integration and error handling. Update MMBench_Evaluator to streamline API key handling based on environment variables. * Refactor EvaluationTracker to directly modify sample data for improved clarity and efficiency. Update MathVerseEvaluator to streamline answer scoring by eliminating unnecessary extraction steps and enhance evaluation prompts. Remove deprecated metrics from configuration files. * Refactor MathVistaEvaluator to integrate llm_judge server for enhanced response generation and evaluation. Streamline API configuration and error handling by removing direct API key management and utilizing a custom server configuration for requests. * Update MathVista task configurations to replace 'gpt_eval_score' with 'llm_as_judge_eval' across multiple YAML files and adjust the result processing function accordingly. This change aligns with the integration of the llm_judge server for enhanced evaluation metrics. * Add new OlympiadBench task configurations for mathematics and physics evaluation. Introduce 'olympiadbench_OE_MM_maths_en_COMP.yaml' and 'olympiadbench_OE_MM_physics_en_COMP.yaml' files, while removing outdated English and Chinese test configurations. Update evaluation metrics to utilize 'llm_as_judge_eval' for consistency across tasks. * Add reasoning model utility functions and integrate into Qwen2_5_VL model. Introduced `parse_reasoning_model_answer` to clean model responses and updated answer processing in the Qwen2_5_VL class to utilize this new function, enhancing response clarity and logging. * Update OlympiadBench task configuration to change 'doc_to_target' from 'answer' to 'final_answer' for improved clarity in response generation. * Refactor olympiadbench_process_results to enhance response clarity. Updated the return format to include question, response, and ground truth for improved evaluation context. Simplified judge result determination logic. * Update olympiadbench_OE_MM_physics_en_COMP.yaml to change 'doc_to_target' from 'answer' to 'final_answer' for improved clarity in response generation. * Update olympiadbench_OE_MM_physics_en_COMP.yaml to change 'doc_to_target' from 'answer' to 'final_answer' for consistency with recent configuration updates and improved clarity in response generation. * Add launcher and sglang launcher for local llm as judge * Lint * add new tasks MMVU and Visual Web Bench (#727) * add mmvu task * fix linting videomathqa * fix mmvu to use llm judge * add visualwebbench task * Add Qwen2_5 chat to support doc_to_messages * Refactor documentation and codebase to standardize naming conventions from 'lm_eval' to 'lmms_eval'. Update task configurations and evaluation metrics accordingly for consistency across the project. * Update model guide and task configurations to replace 'max_gen_toks' with 'max_new_tokens' for consistency across YAML files and documentation. This change aligns with recent updates in the generation parameters for improved clarity in model behavior. * Update task utils and logger * [Main Update] Doc to messages feature support and Split simple and chat mode (#692) * Update deps * Restructured * Delete models * Remove deprecated models * Set up auto doc to messages and chat models * Lint * Allow force simple mode * Add auto doc to messages for audio and video Fix lint Init server structure Restructure to server folder Clean base and providers Add clean method for models Fix loggers save result Fix dummy server error Suppress llava warnings Sample evaluator on llava in the wild Update mmmu doc to messages Update version * Add judge server implementation with various providers and evaluation protocols Add AsyncAzureOpenAIProvider implementation and update provider factory Refactor sample saving in EvaluationTracker to use cleaned data and improve logging Add llm_as_judge_eval metric to multiple tasks and integrate llm_judge API for evaluation * Refactor MathVerseEvaluator to utilize llm_judge server for response generation and evaluation, enhancing API integration and error handling. Update MMBench_Evaluator to streamline API key handling based on environment variables. * Refactor EvaluationTracker to directly modify sample data for improved clarity and efficiency. Update MathVerseEvaluator to streamline answer scoring by eliminating unnecessary extraction steps and enhance evaluation prompts. Remove deprecated metrics from configuration files. * Refactor MathVistaEvaluator to integrate llm_judge server for enhanced response generation and evaluation. Streamline API configuration and error handling by removing direct API key management and utilizing a custom server configuration for requests. * Update MathVista task configurations to replace 'gpt_eval_score' with 'llm_as_judge_eval' across multiple YAML files and adjust the result processing function accordingly. This change aligns with the integration of the llm_judge server for enhanced evaluation metrics. * Add new OlympiadBench task configurations for mathematics and physics evaluation. Introduce 'olympiadbench_OE_MM_maths_en_COMP.yaml' and 'olympiadbench_OE_MM_physics_en_COMP.yaml' files, while removing outdated English and Chinese test configurations. Update evaluation metrics to utilize 'llm_as_judge_eval' for consistency across tasks. * Add reasoning model utility functions and integrate into Qwen2_5_VL model. Introduced `parse_reasoning_model_answer` to clean model responses and updated answer processing in the Qwen2_5_VL class to utilize this new function, enhancing response clarity and logging. * Update OlympiadBench task configuration to change 'doc_to_target' from 'answer' to 'final_answer' for improved clarity in response generation. * Refactor olympiadbench_process_results to enhance response clarity. Updated the return format to include question, response, and ground truth for improved evaluation context. Simplified judge result determination logic. * Update olympiadbench_OE_MM_physics_en_COMP.yaml to change 'doc_to_target' from 'answer' to 'final_answer' for improved clarity in response generation. * Update olympiadbench_OE_MM_physics_en_COMP.yaml to change 'doc_to_target' from 'answer' to 'final_answer' for consistency with recent configuration updates and improved clarity in response generation. * Add launcher and sglang launcher for local llm as judge * Lint * add new tasks MMVU and Visual Web Bench (#727) * add mmvu task * fix linting videomathqa * fix mmvu to use llm judge * add visualwebbench task * Add Qwen2_5 chat to support doc_to_messages * Refactor documentation and codebase to standardize naming conventions from 'lm_eval' to 'lmms_eval'. Update task configurations and evaluation metrics accordingly for consistency across the project. * Update model guide and task configurations to replace 'max_gen_toks' with 'max_new_tokens' for consistency across YAML files and documentation. This change aligns with recent updates in the generation parameters for improved clarity in model behavior. * Refactor evaluation logic to ensure distributed execution only occurs when multiple processes are active. Update metrics handling in OpenAI Math task to correctly track exact matches and coverage based on results. * Fix text auto messages * Update docs * Add vllm chat models * Add openai compatible * Add sglang runtime * Fix errors * Fix sglang error * Add Claude Code Action workflow configuration * Refactor VLLM model initialization and update generation parameters across tasks. Change model version to a more generic name and adjust sampling settings to enable sampling and increase max new tokens for better performance. * Update max_new_tokens in Huggingface model and enhance metrics handling in OpenAI math task. Remove breakpoint in VLLM model initialization. * Allow logging task input * Add development guidelines document outlining core rules, coding best practices, and error resolution strategies for the codebase. * fix repr and group * Add call tools for async openai with mcp client * Add examples * Support multi-node eval * Fix grouping func * Feature/inference throughput logging (#747) * Add inference throughput logging to chat models Implements TPOT (Time Per Output Token) and inference speed metrics: - TPOT = (e2e_latency - TTFT) / (num_output_tokens - 1) - Inference Speed = 1 / TPOT tokens/second Modified chat models: - openai_compatible.py: API call timing with token counting - vllm.py: Batch-level timing with per-request metrics - sglang.py: Timing with meta_info extraction - huggingface.py: Batch processing with token calculation - llava_hf.py: Single-request timing with error handling - qwen2_5_vl.py: Batch timing implementation Features: - Precise timing around model.generate() calls - TTFT estimation when not available from model - Comprehensive logging with formatted metrics - Batch processing support - Error handling for robustness 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * Add throughput metrics documentation and update logging in chat models * Add gen metric utils * Revise qwen logging * Revise llava_hf logging * Revise hf model loggging * Revise sglang logging * Support vllm logging * Add open logging --------- Co-authored-by: Claude <[email protected]> Co-authored-by: kcz358 <[email protected]> * Refactor evaluation process to utilize llm_judge API - Updated internal evaluation scripts for D170, DC100, and DC200 tasks to replace GPT evaluation with llm_judge evaluation. - Introduced custom prompts for binary evaluation based on model responses and ground truth. - Modified YAML configuration files to reflect changes in the evaluation metrics and aggregation methods. - Enhanced error handling and logging for evaluation failures. This change aims to improve the accuracy and reliability of model evaluations across different tasks. * Dev/olympiad bench (#762) * Refactor vLLM model files and add OlympiadBench evaluation utilities - Cleaned up imports and removed unused variables in `vllm.py`. - Updated threading configuration in `simple/vllm.py` to use environment variables. - Introduced new utility functions for processing OlympiadBench documents and results in `utils.py`, `zh_utils.py`, and `en_utils.py`. - Added evaluation logic for OlympiadBench tasks in `olympiadbench_evals.py`. - Created multiple YAML configuration files for various OlympiadBench tasks, including math and physics in both English and Chinese. - Implemented aggregation functions for results processing in the OlympiadBench context. * Implement OlympiadBench evaluation utilities and refactor math verification - Introduced new utility functions for processing OlympiadBench documents and results in `en_utils.py` and `zh_utils.py`. - Added a custom timeout decorator in `math_verify_utils.py` to replace the previous signal-based timeout. - Created multiple YAML configuration files for various OlympiadBench tasks, including math and physics in both English and Chinese. - Removed outdated files from the `olympiadbench_official` directory to streamline the codebase. - Enhanced evaluation logic for OlympiadBench tasks in `olympiadbench_evals.py` and added aggregation functions for results processing. * Update mathvision utility imports and modify YAML configurations for OlympiadBench - Added error handling for importing evaluation utilities in `utils.py` to improve robustness. - Changed `doc_to_target` from "answer" to "final_answer" in both `olympiadbench_all_boxed.yaml` and `olympiadbench_boxed.yaml` to ensure consistency in output naming. * Update YAML configurations for AIME tasks - Changed `do_sample` parameter to `true` in `aime24_figures_agg64.yaml` to enable sampling during generation. - Added new configuration file `aime25_nofigures_agg64.yaml` for a new task, including detailed metrics and filtering options for evaluation. These updates enhance the flexibility and functionality of the AIME evaluation tasks. * Refactor internal evaluation scripts for consistency and readability - Removed unnecessary blank lines in `d170_cn_utils.py`, `d170_en_utils.py`, `dc100_en_utils.py`, and `dc200_cn_utils.py` to improve code clarity. - Streamlined the `evaluate_binary` API call formatting for better readability. These changes enhance the maintainability of the evaluation scripts across different tasks. * Update documentation for `lmms_eval` to enhance clarity and usability - Revised command-line interface section in `commands.md` for improved readability and updated links to the main README. - Enhanced `current_tasks.md` with clearer instructions for listing supported tasks and their question counts. - Added comprehensive model examples in `model_guide.md` for image, video, and audio models, including implementation details and key notes. - Expanded `README.md` to provide an overview of the framework's capabilities and updated the table of contents for better navigation. - Included new audio model examples in `run_examples.md` to demonstrate usage. - Introduced an audio task example in `task_guide.md` to guide users in configuring audio tasks effectively. These updates aim to improve the overall documentation experience for users and developers working with `lmms_eval`. * Introduce LMMS-Eval v0.4: Major update with unified message interface, multi-node distributed evaluation, and enhanced judge API * Add agg8 task and fix data path * Fix warning * Remove bug report documentation from the codebase, consolidating information on identified bugs and fixes for improved clarity and maintainability. * Add announcement for the release of `lmms-eval` v0.4.0 in README.md * Enhance documentation for LMMS-Eval v0.4 with detailed installation instructions, system requirements, and troubleshooting tips. * Remove outdated system requirements and installation instructions from LMMS-Eval v0.4 documentation to streamline content and improve clarity. * Fix datetime format string in olympiadbench submission file naming Co-authored-by: drluodian <[email protected]> * Fix video frame handling in protocol with range() for consistent iteration Co-authored-by: drluodian <[email protected]> * Convert vLLM environment variables to integers for proper type handling Co-authored-by: drluodian <[email protected]> * Fix force_simple model selection to check model availability Co-authored-by: drluodian <[email protected]> * Fix format issue and add avg@8 for aime * Allow vllm for tp * fix parsing logic * Fix OpenAI payload max tokens parameter to use max_new_tokens Co-authored-by: drluodian <[email protected]> * Update OpenAI payload handling to include support for model version o4 and remove max_tokens parameter * Refactor model version handling across evaluation tasks by removing hardcoded GPT model names and replacing them with environment variable support for dynamic model versioning. Update server configuration to utilize the unified judge API for improved response handling. * batch update misused calls for eval model * Update evaluation tasks to use environment variables for GPT model versioning, replacing hardcoded values with dynamic configuration. Remove unused YAML loading logic in multilingual LLAVA benchmark utilities. * Enhance VLLM configuration to support distributed execution for multiple processes. Update multilingual LLAVA benchmark YAML files to include dataset names and remove deprecated config entries. * Remove reviewer guideline and co-authored-by mention from contribution instructions in claude.md * Add development guidelines document outlining core rules, coding best practices, and error resolution strategies for the codebase. * Refactor score parsing logic in multiple utility files to include stripping whitespace from the score string before processing. * Update .gitignore to include new workspace directory and modify utility files to enhance response handling by replacing Request object usage with direct server method calls for text generation across multiple evaluation tasks. * Refactor evaluation tasks to utilize the unified judge API by replacing direct server method calls with Request object usage. Update server configuration in multiple utility files to enhance response handling and streamline evaluation processes. * Refactor generation parameter handling in Llava_OneVision model to streamline configuration. Remove redundant default settings and ensure proper handling of sampling parameters based on the do_sample flag. Update multiple YAML task files to increase max_new_tokens and comment out temperature settings for clarity. Introduce new YAML configuration for MMMU validation reasoning task. * Enhance score processing logic in utility functions to improve error handling and validation. Implement robust regex patterns for score extraction, ensuring all components are accounted for and scores are clamped within valid ranges. Add logging for better traceability of errors and fallback mechanisms for invalid inputs in the mia_bench evaluation process. * Fix launch error when num proc = 1 * Refactor VLLM model parameter handling to simplify distributed execution logic. Remove redundant checks for tensor parallelism and streamline generation parameter settings by eliminating unused temperature and top_p configurations. * Refactor VLLM message handling to prioritize image URLs before text content. Remove unused distributed executor backend parameter for cleaner execution logic. * feat(vllm): Set default max_new_tokens to 4096, temperature to 0, and top_p to 0.95 * docs: Update lmms-eval-0.4 documentation with images and installation instructions * docs: Update lmms-eval-0.4 documentation to include backward compatibility check * refactor: Simplify server config instantiation in utils files * docs: Update supported tasks count in README * Update docs * Fix mathverse bugs * docs: Update images in lmms-eval-0.4.md * docs: Remove API Benefits and Upcoming Benchmarks sections * docs: Update image URL for Unified Message Interface * docs: Fix typo in LMMS-Eval v0.4 documentation Corrected "27.8/16.40" to "27.8/26.40" in the performance comparison table. Also corrected "16.78/13.82" to "16.78/15.82" in the performance comparison table. * fix(docs): Correct typo in LMMS-Eval v0.4 performance comparison table * refactor(docs): Refactor LMMS-Eval v0.4 performance table for clarity * Update docs --------- Co-authored-by: kcz358 <[email protected]> Co-authored-by: Cong <[email protected]> Co-authored-by: Claude <[email protected]> Co-authored-by: Cursor Agent <[email protected]>
1 parent 7fd8553 commit b7b4b1d

File tree

278 files changed

+9033
-5025
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

278 files changed

+9033
-5025
lines changed

.github/workflows/claude.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
- name: Claude Code Action Official
2+
uses: anthropics/claude-code-action@beta

.gitignore

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -47,4 +47,6 @@ scripts/
4747
.venv
4848
outputs/
4949
span.log
50-
uv.lock
50+
uv.lock
51+
workspace/*
52+
.claude/*

CLAUDE.md

Lines changed: 169 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,169 @@
1+
# Development Guidelines
2+
3+
This document contains critical information about working with this codebase. Follow these guidelines precisely.
4+
5+
## Core Development Rules
6+
7+
1. Package Management
8+
- ONLY use uv, NEVER pip
9+
- Installation: `uv add package`
10+
- Running tools: `uv run tool`
11+
- Upgrading: `uv add --dev package --upgrade-package package`
12+
- FORBIDDEN: `uv pip install`, `@latest` syntax
13+
14+
2. Code Quality
15+
- Type hints required for all code
16+
- Public APIs must have docstrings
17+
- Functions must be focused and small
18+
- Follow existing patterns exactly
19+
- Line length: 88 chars maximum
20+
21+
3. Testing Requirements
22+
- Framework: `uv run pytest`
23+
- Async testing: use anyio, not asyncio
24+
- Coverage: test edge cases and errors
25+
- New features require tests
26+
- Bug fixes require regression tests
27+
28+
4. Code Style
29+
- PEP 8 naming (snake_case for functions/variables)
30+
- Class names in PascalCase
31+
- Constants in UPPER_SNAKE_CASE
32+
- Document with docstrings
33+
- Use f-strings for formatting
34+
35+
- For commits fixing bugs or adding features based on user reports add:
36+
```bash
37+
git commit --trailer "Reported-by:<name>"
38+
```
39+
Where `<name>` is the name of the user.
40+
41+
- For commits related to a Github issue, add
42+
```bash
43+
git commit --trailer "Github-Issue:#<number>"
44+
```
45+
- NEVER ever mention a `co-authored-by` or similar aspects. In particular, never
46+
mention the tool used to create the commit message or PR.
47+
48+
## Development Philosophy
49+
50+
- **Simplicity**: Write simple, straightforward code
51+
- **Readability**: Make code easy to understand
52+
- **Performance**: Consider performance without sacrificing readability
53+
- **Maintainability**: Write code that's easy to update
54+
- **Testability**: Ensure code is testable
55+
- **Reusability**: Create reusable components and functions
56+
- **Less Code = Less Debt**: Minimize code footprint
57+
58+
## Coding Best Practices
59+
60+
- **Early Returns**: Use to avoid nested conditions
61+
- **Descriptive Names**: Use clear variable/function names (prefix handlers with "handle")
62+
- **Constants Over Functions**: Use constants where possible
63+
- **DRY Code**: Don't repeat yourself
64+
- **Functional Style**: Prefer functional, immutable approaches when not verbose
65+
- **Minimal Changes**: Only modify code related to the task at hand
66+
- **Function Ordering**: Define composing functions before their components
67+
- **TODO Comments**: Mark issues in existing code with "TODO:" prefix
68+
- **Simplicity**: Prioritize simplicity and readability over clever solutions
69+
- **Build Iteratively** Start with minimal functionality and verify it works before adding complexity
70+
- **Run Tests**: Test your code frequently with realistic inputs and validate outputs
71+
- **Build Test Environments**: Create testing environments for components that are difficult to validate directly
72+
- **Functional Code**: Use functional and stateless approaches where they improve clarity
73+
- **Clean logic**: Keep core logic clean and push implementation details to the edges
74+
- **File Organsiation**: Balance file organization with simplicity - use an appropriate number of files for the project scale
75+
76+
77+
## Core Components
78+
79+
- `__main__.py`: Main entry point
80+
- `api`: API for the project
81+
- `tasks`: Tasks for the project
82+
- `models`: Models for the project
83+
- `loggers`: Loggers for the project
84+
- `utils`: Utility functions for the project
85+
- `tests`: Tests for the project
86+
- `configs`: Configs for the project
87+
- `data`: Data for the project
88+
89+
Launch Command:
90+
91+
```bash
92+
python -m lmms_eval --model qwen2_5_vl --model_args pretrained=Qwen/Qwen2.5-VL-3B-Instruct,max_pixels=12845056,attn_implementation=sdpa --tasks mmmu,mme,mmlu_flan_n_shot_generative --batch_size 128 --limit 8 --device cuda:0
93+
```
94+
95+
96+
97+
98+
## Pull Requests
99+
100+
- Create a detailed message of what changed. Focus on the high level description of
101+
the problem it tries to solve, and how it is solved. Don't go into the specifics of the
102+
code unless it adds clarity.
103+
104+
- NEVER ever mention a `co-authored-by` or similar aspects. In particular, never
105+
mention the tool used to create the commit message or PR.
106+
107+
## Python Tools
108+
109+
## Code Formatting
110+
111+
1. Ruff
112+
- Format: `uv run ruff format .`
113+
- Check: `uv run ruff check .`
114+
- Fix: `uv run ruff check . --fix`
115+
- Critical issues:
116+
- Line length (88 chars)
117+
- Import sorting (I001)
118+
- Unused imports
119+
- Line wrapping:
120+
- Strings: use parentheses
121+
- Function calls: multi-line with proper indent
122+
- Imports: split into multiple lines
123+
124+
2. Type Checking
125+
- Tool: `uv run pyright`
126+
- Requirements:
127+
- Explicit None checks for Optional
128+
- Type narrowing for strings
129+
- Version warnings can be ignored if checks pass
130+
131+
3. Pre-commit
132+
- Config: `.pre-commit-config.yaml`
133+
- Runs: on git commit
134+
- Tools: Prettier (YAML/JSON), Ruff (Python)
135+
- Ruff updates:
136+
- Check PyPI versions
137+
- Update config rev
138+
- Commit config first
139+
140+
## Error Resolution
141+
142+
1. CI Failures
143+
- Fix order:
144+
1. Formatting
145+
2. Type errors
146+
3. Linting
147+
- Type errors:
148+
- Get full line context
149+
- Check Optional types
150+
- Add type narrowing
151+
- Verify function signatures
152+
153+
2. Common Issues
154+
- Line length:
155+
- Break strings with parentheses
156+
- Multi-line function calls
157+
- Split imports
158+
- Types:
159+
- Add None checks
160+
- Narrow string types
161+
- Match existing patterns
162+
163+
3. Best Practices
164+
- Check git status before commits
165+
- Run formatters before type checks
166+
- Keep changes minimal
167+
- Follow existing patterns
168+
- Document public APIs
169+
- Test thoroughly

README.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,13 +14,15 @@
1414
1515
🏠 [LMMs-Lab Homepage](https://www.lmms-lab.com/) | 🤗 [Huggingface Datasets](https://huggingface.co/lmms-lab) | <a href="https://emoji.gg/emoji/1684-discord-thread"><img src="https://cdn3.emoji.gg/emojis/1684-discord-thread.png" width="14px" height="14px" alt="Discord_Thread"></a> [discord/lmms-eval](https://discord.gg/zdkwKUqrPy)
1616

17-
📖 [Supported Tasks (90+)](https://github.com/EvolvingLMMs-Lab/lmms-eval/blob/main/docs/current_tasks.md) | 🌟 [Supported Models (30+)](https://github.com/EvolvingLMMs-Lab/lmms-eval/tree/main/lmms_eval/models) | 📚 [Documentation](docs/README.md)
17+
📖 [Supported Tasks (100+)](https://github.com/EvolvingLMMs-Lab/lmms-eval/blob/main/docs/current_tasks.md) | 🌟 [Supported Models (30+)](https://github.com/EvolvingLMMs-Lab/lmms-eval/tree/main/lmms_eval/models) | 📚 [Documentation](docs/README.md)
1818

1919
---
2020

2121
## Annoucement
2222

23-
We warmly welcome contributions from the open-source community!
23+
- [2025-07] 🚀🚀 We have released the `lmms-eval-0.4`. Please refer to the [release notes](https://github.com/EvolvingLMMs-Lab/lmms-eval/releases/tag/v0.4.0) for more details. This is a major update with new features and improvements, for users wish to use `lmms-eval-0.3` please refer to the branch `stable/v0d3`.
24+
25+
- [2025-04] 🚀🚀 Introducing Aero-1-Audio — a compact yet mighty audio model. We have officially supports evaluation for Aero-1-Audio and it supports batched evaluations! Feel free to try out.
2426

2527
- [2025-07] 🎉🎉 We welcome the new task [PhyX](https://phyx-bench.github.io/), the first large-scale benchmark designed to assess models capacity for physics-grounded reasoning in visual scenarios.
2628
- [2025-06] 🎉🎉 We welcome the new task [VideoMathQA](https://mbzuai-oryx.github.io/VideoMathQA), designed to evaluate mathematical reasoning in real-world educational videos.

bug_report.md

Lines changed: 0 additions & 131 deletions
This file was deleted.

docs/README.md

Lines changed: 23 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,30 @@
11
# LMMs Eval Documentation
22

3-
Welcome to the docs for `lmms-eval`!
3+
Welcome to the documentation for `lmms-eval` - a unified evaluation framework for Large Multimodal Models!
4+
5+
This framework enables consistent and reproducible evaluation of multimodal models across various tasks and modalities including images, videos, and audio.
6+
7+
## Overview
8+
9+
`lmms-eval` provides:
10+
- Standardized evaluation protocols for multimodal models
11+
- Support for image, video, and audio tasks
12+
- Easy integration of new models and tasks
13+
- Reproducible benchmarking with shareable configurations
414

515
Majority of this documentation is adapted from [lm-eval-harness](https://github.com/EleutherAI/lm-evaluation-harness/)
616

717
## Table of Contents
818

9-
* To learn about the command line flags, see the [commands](commands.md)
10-
* To learn how to add a new moddel, see the [Model Guide](model_guide.md).
11-
* For a crash course on adding new tasks to the library, see our [Task Guide](task_guide.md).
12-
* If you need to upload your datasets into correct HF format with viewer supported, please refer to [tools](https://github.com/EvolvingLMMs-Lab/lmms-eval/tree/pufanyi/hf_dataset_docs/tools)
19+
* **[Commands Guide](commands.md)** - Learn about command line flags and options
20+
* **[Model Guide](model_guide.md)** - How to add and integrate new models
21+
* **[Task Guide](task_guide.md)** - Create custom evaluation tasks
22+
* **[Current Tasks](current_tasks.md)** - List of all supported evaluation tasks
23+
* **[Run Examples](run_examples.md)** - Example commands for running evaluations
24+
* **[Version 0.3 Features](lmms-eval-0.3.md)** - Audio evaluation and new features
25+
* **[Throughput Metrics](throughput_metrics.md)** - Understanding performance metrics
26+
27+
## Additional Resources
28+
29+
* For dataset formatting tools, see [lmms-eval tools](https://github.com/EvolvingLMMs-Lab/lmms-eval/tree/main/tools)
30+
* For the latest updates, visit our [GitHub repository](https://github.com/EvolvingLMMs-Lab/lmms-eval)

0 commit comments

Comments
 (0)