Skip to content

Releases: EvolvingLMMs-Lab/lmms-eval

v0.4: multi-node, tp + dp parallel, unified llm-as-judge api, `doc_to_message` support

30 Jul 04:31
b7b4b1d
Compare
Choose a tag to compare

😻 LMMs-Eval upgrades to v0.4, better evals for better models.

  • multi-node evals, tp+dp parallel.
  • new doc_to_message support for interleaved modalities inputs, fully compatible with OpenAI official message format, suitable for evaluation in more complicated tasks.
  • unified llm-as-judge API to support more versatile metric functions, async mode support for large concurrency and throughput.
  • more features:
    • tool-uses for agentic tasks
    • programmic API for supporting more third-party training frameworks like nanoVLM, now call LMMs-Eval in your training loop to inspect your models on more tasks.

This upgrade focuses on accelerating evaluation and improves consistency, addressing the needs of reasoning models with longer outputs, multiple rollouts, and in scenarios that LLM-as-judge is required for general domain tasks.

With LMMs-Eval, we dedicated to build the frontier evaluation toolkit to accelerate development for better multimodal models.

More at: https://github.com/EvolvingLMMs-Lab/lmms-eval

Meanwhile, we are currently building the next frontier fully open multimodal models and new supporting frameworks.

Vibe check with us: https://lmms-lab.com

What's Changed

  • [Improvement] Accept chat template string in vLLM models by @VincentYCYao in #768
  • [Feat] fix tasks and vllm to reproduce better results. by @Luodian in #774
  • Remove the deprecated tasks related to the nonexistent lmms-lab/OlympiadBench dataset by @yaojingguo in #776
  • [Feat] LMMS-Eval 0.4 by @Luodian in #721

Full Changelog: v0.3.5...v0.4

v0.3.5

21 Jul 12:30
f7a6d6b
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.3.4...v0.3.5

What's Changed

New Contributors

Read more

v0.3.4

30 May 07:06
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.3.3...v0.3.4

v0.3.3 Fix models and add model examples

20 Apr 06:26
514082e
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.3.2...v0.3.3

v0.3.2

06 Apr 12:13
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.3.1...v0.3.2

v0.3.1

22 Feb 09:15
eb2dadc
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.3.0...v0.3.1

v0.3.0

29 Nov 09:46
754640a
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.2.4...v0.3.0

v0.2.4 add `generate_until_multi_round` to support interative and multi-round evaluations; add models and fix glitches

03 Oct 15:33
af395ae
Compare
Choose a tag to compare

What's Changed

  • [Fix] Fix bugs in returning result dict and bring back anls metric by @kcz358 in #221
  • fix: fix wrong args in wandb logger by @Luodian in #226
  • [feat] Add check for existence of accelerator before waiting by @Luodian in #227
  • add more language tasks and fix fewshot evaluation bugs by @Luodian in #228
  • Remove unnecessary LM object removal in evaluator by @Luodian in #229
  • [fix] Shallow copy issue by @pufanyi in #231
  • [Minor] Fix max_new_tokens in video llava by @kcz358 in #237
  • Update LMMS evaluation tasks for various subjects by @Luodian in #240
  • [Fix] Fix async append result in different order issue by @kcz358 in #244
  • Update the version requirement for transformers by @zhijian-liu in #235
  • Add new LMMS evaluation task for wild vision benchmark by @Luodian in #247
  • Add raw score to wildvision bench by @Luodian in #250
  • [Fix] Strict video to be single processing by @kcz358 in #246
  • Refactor wild_vision_aggregation_raw_scores to calculate average score by @Luodian in #252
  • [Fix] Bring back process result pbar by @kcz358 in #251
  • [Minor] Update utils.py by @YangYangGirl in #249
  • Refactor distributed gathering of logged samples and metrics by @Luodian in #253
  • Refactor caching module and fix serialization issue by @Luodian in #255
  • [Minor] Bring back fix for metadata by @kcz358 in #258
  • [Model] support minimonkey model by @white2018 in #257
  • [Feat] add regression test and change saving logic related to output_path by @Luodian in #259
  • [Feat] Add support for llava_hf video, better loading logic for llava_hf ckpt by @kcz358 in #260
  • [Model] support cogvlm2 model by @white2018 in #261
  • [Docs] Update and sort current_tasks.md by @pbcong in #262
  • fix error name with infovqa task by @ZhaoyangLi-nju in #265
  • [Task] Add MMT and MMT_MI (Multiple Image) Task by @ngquangtrung57 in #270
  • mme-realworld by @yfzhang114 in #266
  • [Model] support Qwen2 VL by @abzb1 in #268
  • Support new task mmworld by @jkooy in #269
  • Update current tasks.md by @pbcong in #272
  • [feat] support video evaluation for qwen2-vl and add mix-evals-video2text by @Luodian in #275
  • [Feat][Task] Add multi-round evaluation in llava-onevision; Add MMSearch Benchmark by @CaraJ7 in #277
  • [Fix] Model name None in Task manager, mix eval model specific kwargs, claude retrying fix by @kcz358 in #278
  • [Feat] Add support for evaluation of Oryx models by @dongyh20 in #276
  • [Fix] Fix the error when running models caused by generate_until_multi_round by @pufanyi in #281
  • [fix] Refactor GeminiAPI class to add video pooling and freeing by @pufanyi in #287
  • add jmmmu by @AtsuMiyai in #286
  • [Feat] Add support for evaluation of InternVideo2-Chat && Fix evaluation for mvbench by @yinanhe in #280

New Contributors

Full Changelog: v0.2.3...v0.2.4

v0.2.3.post1

04 Sep 15:16
9d00bfa
Compare
Choose a tag to compare

What's Changed

  • [Fix] Fix bugs in returning result dict and bring back anls metric by @kcz358 in #221
  • fix: fix wrong args in wandb logger by @Luodian in #226

Full Changelog: v0.2.3...v0.2.3.post1

v0.2.3 add language evaluations and remove registration to speedup loading tasks and models

01 Sep 11:21
30a0745
Compare
Choose a tag to compare

What's Changed

  • Update the blog link by @pufanyi in #196
  • Bring back PR#52 by @kcz358 in #198
  • fix: update from previous model_specific_prompt to current lmms_eval_kwargs to avoid warnings by @Luodian in #206
  • [Feat] SGLang SRT commands in one go, async input for openai server by @kcz358 in #212
  • [Minor] Add kill sglang process by @kcz358 in #213
  • Support textonly inference for LLaVA-OneVision. by @CaraJ7 in #215
  • Fix videomme evaluation by @zhijian-liu in #209
  • [feat] remove registeration logic and adding language evaluation tasks. by @Luodian in #218

New Contributors

Full Changelog: v0.2.2...v0.2.3