Leaderboard changes for massive multilinguality #674

Muennighoff · 2024-05-12T05:43:28Z

Muennighoff
May 12, 2024
Maintainer

How should we adapt the leaderboard to best incorporate the many newly added languages as part of MMTEB? I think this will also inform the model running phase, so it makes sense to discuss it now. I'd suggest something like:

Add new tabs for each language within each task except Bitext. Remove Other Languages & Other tabs for Classif. & STS.
For Bitext, I see two options (i lean toward A):
(A) Add tabs like English-X, French-X, Chinese-X. Flores would make this 200 tabs and each would have 199 columns.
(B) Make it one tab per dataset, i.e. Flores has one, Tatoeba has one etc.
Every tab has a pool of datasets defined here to allow easy benchmarking for a specific tab maybe also with defined dataset revisions.
For the Overall tab, add a "Multilingual" tab that will be our "Multilingual MTEB" average, which also is a pool of maybe 50 multilingual datasets we deem the best (probably optimize for some form of diversity in terms of tasks, languages, domains).
If we add new datasets to existing language tabs, I see three options:
(A) We retire the tab: (a) add the new one as a tab, (b) run the top ~10 models on the new one and put them up (c) leave the old one there for some time to bridge. E.g. if we add new Chinese datasets and make a new Chinese Overall Pool, we rename the current Chinese Overall Pool to Chinese_Legacy & tell people to benchmark on the new pool instead.
(B) We make the tab dynamic: We could add subtabs corresponding to coverage levels of the pool, so that if the pool changes (but some datasets remain), the models just move lower on the coverage. It could e.g. be 100%, 75%, 50% corresponding to how many of the datasets in the pool we have eval scores for of the model. Then current models on the Chinese tab which we don't re-eval would now just only visible on the 75% & 50% subtabs going forward assuming that 75% of datasets in the pool stay the same.
(C) We expand the tab: Similar to (A), but just instead of retiring we add a new subtab & keep the old one. E.g. there'd be Overall > Chinese > Chinese v1 & Chinese v2.
Maybe allow users to select tasks & languages (maybe even datasets) in the UI to construct their own average

Other changes, maybe not needed for this update:
8. Add two levels of tabs within each task tab such that it becomes Task -> Domain -> Language (to separate out Law & other things later)
9. Introduce a new MTEB English-only average that focuses on newer datasets or an MTEB Lite average that is easier to run

What do you think/disagree/agree with? Also, contributions on this should also give points I think - Maybe the reviewer decides the number ad-hoc for each PR? It'd be great if someone is willing to contribute on this 🙌
cc @tomaarsen @orionw @KennethEnevoldsen @imenelydiaker & anyone else interested 😊

KennethEnevoldsen · 2024-05-12T12:12:36Z

KennethEnevoldsen
May 12, 2024
Maintainer

Thanks for starting this discussion @Muennighoff. I was thinking of a solutions somewhat like this:

Seems more scalable long term and allow users to ignore tasks that they don't care about and add tasks that they do. You could simply do a "select benchmark" and then hide the rest in "advanced filters".

This change falls in line with your suggestion only with the change that the 'tab' becomes a select menu. It can also be implemented in steps (adding advanced filters later on). One thing we might consider here as well is the averaging (one solution is to implement "average by" where it can default to "task-type" (current implementation" but you could select multiple e..g.: "language", "task_type".

5 replies

Muennighoff May 12, 2024
Maintainer Author

That looks great! So by default only one leaderboard would show? I.e. no tabs anymore?

KennethEnevoldsen May 12, 2024
Maintainer

That would be the intention at least. Assuming people like it

orionw May 13, 2024
Maintainer

I like this mockup also -- we can keep the pre-defined benchmarks as a prominent drop down but also allow the mix-and-matching. I think this also looks relatively cleaner compared to all the many tabs and sub-tabs.

run the top ~10 models on the new one and put them up

I think this is the only scalable solution if we want to update them, but I worry it will be increasingly complex to maintain

KennethEnevoldsen May 13, 2024
Maintainer

I worry it will be increasingly complex to maintain

Def. worry about this one as well. A solution for now might be to rely on curated benchmarks.

isaac-chung May 13, 2024
Collaborator

I like the mockup as well. In terms of offering curated/suggested groups (like the previous Chinese/French tabs), we could provide groupings by genus (e.g. romance) and/or language family (Indo-European) as additional select menus?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Leaderboard changes for massive multilinguality #674

{{title}}

Replies: 1 comment 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Leaderboard changes for massive multilinguality #674

Muennighoff May 12, 2024 Maintainer

Replies: 1 comment · 5 replies

KennethEnevoldsen May 12, 2024 Maintainer

Muennighoff May 12, 2024 Maintainer Author

KennethEnevoldsen May 12, 2024 Maintainer

orionw May 13, 2024 Maintainer

KennethEnevoldsen May 13, 2024 Maintainer

isaac-chung May 13, 2024 Collaborator

Muennighoff
May 12, 2024
Maintainer

Replies: 1 comment 5 replies

KennethEnevoldsen
May 12, 2024
Maintainer

Muennighoff May 12, 2024
Maintainer Author

KennethEnevoldsen May 12, 2024
Maintainer

orionw May 13, 2024
Maintainer

KennethEnevoldsen May 13, 2024
Maintainer

isaac-chung May 13, 2024
Collaborator