Skip to content

The official evaluation suite and dynamic data release for MixEval.

Notifications You must be signed in to change notification settings

JinjieNi/MixEval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🏠 Homepage | 🏆 Leaderboard | 📜 arXiv | 📝 blog | 🤗 HF Dataset | 🤗 HF Paper | 𝕏 Twitter


Static Badge Twitter GitHub Repo stars Static Badge Static Badge


Benchmark correlations (%) with Chatbot Arena Elo, against the total costs of evaluating a single GPT-3.5-Turbo-0125 model. MixEval and MixEval-Hard show the highest correlations with Arena Elo and Arena Elo (En) among leading benchmarks. We reference the crowdsourcing price for Amazon Mechanical Turk ($0.05 per vote) when estimating the cost of evaluating a single model on Chatbot Arena (approximately $2,936). Chatbot Arena is prohibitively expensive, while MixEval and MixEval-Hard are cheap and cost-effective alternatives. For more details, please refer to our paper.


⚡ News

[2024-11-09] Our evaluation suite now supports local model parsers! Check here for details!

[2024-10-20] MixEval-X is released! Checkout the project page, paper, and github repo to learn more about this real-world any-to-any benchmark!🌟 (Note that MixEval-X is the any-to-any version of MixEval while not including the text2text modality. MixEval and MixEval-X are two separate evaluations❗)

[2024-09-27] MixEval is accepted to Neurips 2024.

[2024-06-29] Our evaluation suite now supports evaluating local checkpoints, check here for details!

[2024-06-29] Our evaluation suite now supports other apis for model parser, check here.

MixEval

We introduce MixEval, a ground-truth-based dynamic benchmark derived from off-the-shelf benchmark mixtures, which evaluates LLMs with a highly capable model ranking (i.e., 0.96 correlation with Chatbot Arena) while running locally and quickly (6% the time and cost of running MMLU), with its queries being stably and effortlessly updated every month to avoid contamination.

The MixEval consists of two benchmarks: MixEval and MixEval-Hard, both updated with our fast, stable pipeline periodically. Both of them contain two splits, i.e., free-form and multiple-choice. Their relationships are presented below:

 MixEval (dynamic)
    │
    ├── MixEval
    │   ├──free-form.json
    │   └──multiple-choice.json
    │
    └── MixEval-Hard
        ├──free-form.json
        └──multiple-choice.json

See our homepage and paper for more details!


Click-and-Go LLM Evaluation Suite

This repository hosts the evaluation code and dynamic data release for MixEval. The current dynamic benchmark version is displayed at the top of this page. We offer a reliable click-and-go evaluation suite compatible with both open-source and proprietary models, which includes model response generation and score computation. Additionally, this evaluation suite facilitates straightforward registration of custom models and benchmark data.

As demonstrated in the paper, traditional rule-based parsers exhibit significant instability and are prone to considerable errors. We employ either GPT-3.5-Turbo or open-source models as our model parser, which has been proven stable in our and this study.

ATTENTION❗ Feel free to use your own evaluation code to evaluate with MixEval data. We provide the guidelines here.


Quick Start

(Step 1) Clone repo and setup the environment:

git clone https://github.com/Psycoy/MixEval.git
cd MixEval
conda create -n MixEval python=3.11 --yes
conda activate MixEval
bash setup.sh

# setup done

Note: You may have to update the dependencies in setup.py if you are evaluating the up-to-date models.

(Step 2) Setup the OpenAI API key for model parser. Create .env file under root dir (MixEval/) and add the below line to it:

MODEL_PARSER_API=<your openai api key>

The values in Leaderboard use GPT-3.5-Turbo-0125 as the default model parser. Open-source model parsers are also supported, check here for details.

If you are using Azure or APIs for model parser, check here.

(Step 3) Run evaluation and get results. That's all!

python -m mix_eval.evaluate \
    --model_name gemma_11_7b_instruct \
    --benchmark mixeval_hard \
    --version 2024-06-01 \
    --batch_size 20 \
    --max_gpu_memory 5GiB \
    --output_dir mix_eval/data/model_responses/ \
    --api_parallel_num 20

If you want to evaluate models that are not included in mixeval.models.__init__, see here for the simple steps of new model registration.

This command will run both inference and score computation. If you want to run model inference only, check here; if you want to run score computation only, check here.

Model response files and scores will be saved to <output_folder>/<model_name>/<benchmark>/<version>/, and in this case, it's mix_eval/data/model_responses/gemma_11_7b_instruct/mixeval_hard/2024-06-01/. We take the overall score as the reported score in Leaderboard.

Check here if you are evaluating a local checkpoint.

ATTENTION❗ It's important to read the essential configurations here before running the evaluation.


Registering New Models

(Step 1) Add your model file to mixeval/models/ with name your_model_name.py and write the model class in it with the name Model_Class_Name.

  • Open-source chat models are inherited from mixeval.models.base.ChatModel (example file: llama_3_8b_instruct.py).
  • Open-source base models are inherited from mixeval.models.base.BaseModel (example file: llama_3_8b.py).
  • Proprietary models are inherited from mixeval.models.base_api.APIModelBase (example file: gpt_4_turbo_2024_04_09.py, add your api key in .env).
  • In most cases, all you need to do is write a simple model class with a single __init__ function. However, if your model needs more setup, e.g., it requires a different build_model() function, you should override the corresponding function or variable of the parent model.
  • The model file name should be the same with the name you pass to the @register_model() decorator on top of the model class.

(Step 2) Add your model to mixeval.models.__init__.AVAILABLE_MODELS.

  • The entry you add should be in the form of your_model_name: Model_Class_Name. See other models in AVAILABLE_MODELS as a reference.

Only Performing Model Inference

Sometimes you may want to do model inference without computing the scores. You can achieve this by setting the --inference_only flag when running the mix_eval.evaluate module:

python -m mix_eval.evaluate \
    --model_name gemma_11_7b_instruct \
    --benchmark mixeval_hard \
    --version 2024-06-01 \
    --batch_size 20 \
    --max_gpu_memory 5GiB \
    --output_folder mix_eval/data/model_responses/ \
    --inference_only

Model response files will be saved to <output_folder>/<model_name>/<benchmark>/<version>/, and in this example it's mix_eval/data/model_responses/gemma_11_7b_instruct/mixeval_hard/2024-06-01/.

Check here if you are evaluating a local checkpoint.

ATTENTION❗ It's important to read the essential configurations here before running the evaluation.

You can check whether the model response files are complete after running the inference:

python -m mix_eval.utils.check_eval_complete \
    --benchmark mixeval_hard \
    --version 2024-06-01 \
    --chat_models_to_check \
    gpt_4o \
    llama_3_70b_instruct \
    claude_3_opus \
    --base_models_to_check \
    none \
    --model_response_dir mix_eval/data/model_responses/ \
    --out_path mix_eval/data/model_responses/eval_checks.log

The checking results will be written to --out_path; only problematic files will be recorded.


Only Computing Scores

If you want to separately compute the scores, you should

  1. Prepare your model response files. You can use either our evaluation suite (refer to here) or your own (refer to the example response file formats and protocols specified here).
  2. Run the score computation script:
    python -m mix_eval.compute_metrics \
        --benchmark mixeval_hard \
        --version 2024-06-01 \
        --model_response_dir mix_eval/data/model_responses/ \
        --api_parallel_num 20 \
        --models_to_eval \
        gemma_11_7b_instruct \
        gpt_4o \
        claude_3_opus
    

You should set the --api_parallel_num properly according to your OpenAI user tier to avoid rate limits. In general, if you are a Tier-5 user, you can set --api_parallel_num to 100 or more to parse results in 30 seconds.

If you are using Azure or APIs for model parser, check here.

Open-source model parsers are also supported, check here for details.

If you are parsing base models' responses, set the --extract_base_model_response flag to only retain the meaningful part in models' response to get more stablized parsing results.

If you finished the model parsing some time ago and now want to display the model results again, add --compute_score_from_judged_file flag to avoid calling the model parser api again to save your budget. You have to make sure that there exists the parsed files with the name of judge_results_ff_model_judge_gpt-3.5-turbo-0125 and judge_results_mp_model_judge_gpt-3.5-turbo-0125 under the target model response folder, where gpt-3.5-turbo-0125 denotes the model parser name, ff denotes free-form, mp denotes multiple-choice.


What is MixEval?

See our homepage and paper for more details!

MixEval is an approach that bridges the gap between real-world user queries and efficient, reproducible evaluation by leveraging user queries mined from the web and matching them with similar queries from existing benchmarks. MixEval is also the proposed benchmark built with this approach.

MixEval-Hard is the hard version of MixEval, designed to enhance the benchmark's ability to distinguish strong models. It is sampled from MixEval based on model evaluation results, with a higher probability of selecting harder queries. To address distribution deviation, we introduce a rejective sampling process to ensure that the distribution of MixEval-Hard aligns with that of wild queries.

Dynamic evaluation is introduced to mitigate the contamination issue. We periodically update the data points in MixEval and MixEval-Hard using our fast, stable pipeline, which performs benchmark mixture with a different batch of wild queries from the same distribution, showing low variance (0.36 Std. on a 0-100 scale) and significant version difference (85% unique query ratio).


Why to Use MixEval Benchmarks?

MixEval offers five significant advantages for practitioners:

  • Accurate model ranking, demonstrated by a 0.96 correlation with Chatbot Arena1.
  • Fast, cheap and reproducible execution, requiring only 6% the time and cost of MMLU and with no dependence on human input.
  • Dynamic benchmarking enabled by low-effort and stable updating mechanism.
  • A comprehensive and less biased query distribution, as it bases queries on a large-scale web corpus.
  • A fair grading process, ensured by the ground-truth-based grading mechanism.

How Effective is MixEval as a Benchmark Mixture Approach?

MixEval is effective as

  • MixEval and MixEval-Hard achieve the highest correlation with Arena Elo and Arena Elo (En) among all benchmarks.
  • MixEval improves the correlation with Arena Elo and Arena Elo (En) across all its main benchmark splits.
  • MixEval outperforms both benchmark-level and uniform mixtures.
  • MixEval effectively maps real-world user queries to ground-truth-based benchmarks.

🦾 Contribute

Feel free to hit the ⭐star button or 🦾contribute! We review new issues and PRs regularly and will acknowledge your contributions!

We would like to extend our heartfelt gratitude to the following contributors for their exceptional commitment to this repository:

  • @RodriMora
  • @teknium1
  • @philschmid
  • @carstendraschner

📑 Citation

If you found this repository useful, please consider 📑citing:

@article{ni2024mixeval,
  title={MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures},
  author={Ni, Jinjie and Xue, Fuzhao and Yue, Xiang and Deng, Yuntian and Shah, Mahir and Jain, Kabir and Neubig, Graham and You, Yang},
  journal={arXiv preprint arXiv:2406.06565},
  year={2024}
}