Add new benchmark MAIR #1425

sunnweiwei · 2024-11-10T07:01:01Z

Added MAIR (https://arxiv.org/abs/2410.10127, EMNLP 2024), a diverse benchmark for instructed IR.
The data class is defined in mteb/tasks/MAIR/eng/MAIR.py, generating 126 data classes for the 126 tasks in MAIR on the fly.
In benchmarks/benchmarks.py, the benchmark configuration has been added.
Tested several models, and the results are consistent with those of the original repo: https://github.com/sunnweiwei/mair.

Checklist

Run tests locally to make sure nothing is broken using make test.
Run the formatter to format the code using make lint.

Adding datasets checklist

Reason for dataset addition: ...

The added data is introduced in https://arxiv.org/abs/2410.10127, which introduces a benchmark for instructable information retrieval. It contains 126 real-world retrieval tasks across 6 domains, with instructions manually annotated. And the data has been sampled to reduce evaluation costs.

I have run the following models on the task (adding the results to the pr). These can be run using the mteb -m {model_name} -t {task_name} command.
- sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- intfloat/multilingual-e5-small
I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
I have filled out the metadata object in the dataset file (find documentation on it here).
Run tests locally to make sure nothing is broken using make test.
Run the formatter to format the code using make lint.

Adding a model checklist

I have filled out the ModelMeta object to the extent possible
I have ensured that my model can be loaded using
- mteb.get_model(model_name, revision) and
- mteb.get_model_meta(model_name, revision)
I have tested the implementation works on a representative set of tasks.

shizhl-code · 2024-11-10T11:53:19Z

Following the above process, I am currently open a pull request in https://github.com/embeddings-benchmark/results to submit our evaluation results to the newly-added MAIR benchmark.
However, I am not very clear about the format of the result file.

Samoed · 2024-11-10T12:15:09Z

When you run your tasks, MTEB will generate a folder with results from your runs, and you can submit that folder

Samoed · 2024-11-10T12:16:25Z

mteb/tasks/MAIR/eng/MAIR.py

+        return
+    self.corpus, self.queries, self.relevant_docs = {}, {}, {}
+    queries_path = self.metadata_dict["dataset"]["path"]
+    docs_path = self.metadata_dict["dataset"]["path"].replace("-Queries", "-Docs")


Can you place queries and docs in same repo?

Thanks for the feedback. To keep Q/D in one repo, I could create a separate repo for each task.

But is that necessary? I think having two repo for Q and D would be easier to manage than having over hundreds of repo for each task.

You can create different splits for queries and documents in the same repo

Thanks, I see.

One issue is that the data in MAIR has a two-level structure: task → subtasks, as some tasks contain multiple subtasks (e.g., IFEval, SWE-Bench). It’s tricky to maintain this structure without flattening it into a single repository.

And I think people may not need to download all the data if they’re only interested in evaluating a few specific tasks. So, if we still need to put Q and D in a single repo, the best way might be to generate 126 separate repo (for each task).

Or do you have any other suggestions?

shizhl-code · 2024-11-10T12:16:53Z

Another question is that our benchmark has two settings, i.e., evaluting the model with and without instruction. Should I store the result with instruction and without instruction into two files, respectively?
Appreciate for any feedback and response!

Samoed · 2024-11-10T12:23:27Z

If you have results for both instruct and non-instruct, it might be better to create separate tasks, though @orionw might have a clearer perspective on this

orionw · 2024-11-10T13:57:57Z

+1 to adding a duplicate task if you have a specific instruction you want them to use for each. Otherwise models can define their own instructions and in that case you could just submit results to the same task but with a different prompt in the meta info.

If you’re adding an instruction variant (and once #1359 is in) you’d just need to add a version of those tasks with all the same attributes but also a config/attribute called “self.instruction” (query-id -> instruction_text) format

sunnweiwei · 2024-11-10T18:52:46Z

Hi. If we have duplicate tasks with different instructions, will they appear in separate tables on the leaderboard? Like would there be a one called (XXX with instruction) and another called (XXX without instruction)?

orionw · 2024-11-10T19:32:12Z

@sunnweiwei they would appear as different datasets yes. So you could have one leaderboard with them and one without, if desired. Or push it all together into one benchmark.

Does that answer the question? Or do you mean more than one instruction per dataset?

sunnweiwei · 2024-11-10T19:41:00Z

Thanks for the answer! I was thinking to put them into one table for benchmarking purpose, maybe adding a column to indicate if instructions were used. Then people could compare models with and without instructions in the same table. Good to know we can do this then.

Muennighoff · 2025-01-09T05:44:41Z

Would be great to get this in @sunnweiwei in case you're still working on it; I think it'll be very useful to the community!

sunnweiwei added 6 commits November 10, 2024 01:09

Create MAIR.py

e721ae2

Create __init__.py

450b117

Create __init__.py

19737df

Update benchmarks.py

ba8e3ed

Update __init__.py

a176959

Update MAIR.py

f19dc81

sunnweiwei mentioned this pull request Nov 10, 2024

Add new benchmark MAIR #1426

Open

sunnweiwei added 5 commits November 10, 2024 02:24

Update MAIR.py

4c85657

Update benchmarks.py

bc53ba3

Update benchmarks.py

1842e47

Update MAIR.py

7f003f0

Update MAIR.py

12f49e6

Samoed reviewed Nov 10, 2024

View reviewed changes

orionw mentioned this pull request Nov 11, 2024

Consolidate Retrieval/Reranking/Instruction Variants #1359

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new benchmark MAIR #1425

Add new benchmark MAIR #1425

sunnweiwei commented Nov 10, 2024 •

edited by isaac-chung

Loading

shizhl-code commented Nov 10, 2024

Samoed commented Nov 10, 2024

Samoed Nov 10, 2024

sunnweiwei Nov 10, 2024 •

edited

Loading

Samoed Nov 10, 2024

sunnweiwei Nov 10, 2024

shizhl-code commented Nov 10, 2024

Samoed commented Nov 10, 2024

orionw commented Nov 10, 2024 •

edited

Loading

sunnweiwei commented Nov 10, 2024

orionw commented Nov 10, 2024 •

edited

Loading

sunnweiwei commented Nov 10, 2024

Muennighoff commented Jan 9, 2025

Add new benchmark MAIR #1425

Are you sure you want to change the base?

Add new benchmark MAIR #1425

Conversation

sunnweiwei commented Nov 10, 2024 • edited by isaac-chung Loading

Checklist

Adding datasets checklist

Adding a model checklist

shizhl-code commented Nov 10, 2024

Samoed commented Nov 10, 2024

Samoed Nov 10, 2024

Choose a reason for hiding this comment

sunnweiwei Nov 10, 2024 • edited Loading

Choose a reason for hiding this comment

Samoed Nov 10, 2024

Choose a reason for hiding this comment

sunnweiwei Nov 10, 2024

Choose a reason for hiding this comment

shizhl-code commented Nov 10, 2024

Samoed commented Nov 10, 2024

orionw commented Nov 10, 2024 • edited Loading

sunnweiwei commented Nov 10, 2024

orionw commented Nov 10, 2024 • edited Loading

sunnweiwei commented Nov 10, 2024

Muennighoff commented Jan 9, 2025

sunnweiwei commented Nov 10, 2024 •

edited by isaac-chung

Loading

sunnweiwei Nov 10, 2024 •

edited

Loading

orionw commented Nov 10, 2024 •

edited

Loading

orionw commented Nov 10, 2024 •

edited

Loading