Leaderboard changes for massive multilinguality #674
Replies: 1 comment 5 replies
-
Thanks for starting this discussion @Muennighoff. I was thinking of a solutions somewhat like this: Seems more scalable long term and allow users to ignore tasks that they don't care about and add tasks that they do. You could simply do a "select benchmark" and then hide the rest in "advanced filters". This change falls in line with your suggestion only with the change that the 'tab' becomes a select menu. It can also be implemented in steps (adding advanced filters later on). One thing we might consider here as well is the averaging (one solution is to implement "average by" where it can default to "task-type" (current implementation" but you could select multiple e..g.: "language", "task_type". |
Beta Was this translation helpful? Give feedback.
-
How should we adapt the leaderboard to best incorporate the many newly added languages as part of MMTEB? I think this will also inform the model running phase, so it makes sense to discuss it now. I'd suggest something like:
Other Languages
&Other
tabs for Classif. & STS.(A) Add tabs like English-X, French-X, Chinese-X. Flores would make this 200 tabs and each would have 199 columns.
(B) Make it one tab per dataset, i.e. Flores has one, Tatoeba has one etc.
(A) We retire the tab: (a) add the new one as a tab, (b) run the top ~10 models on the new one and put them up (c) leave the old one there for some time to bridge. E.g. if we add new Chinese datasets and make a new Chinese Overall Pool, we rename the current Chinese Overall Pool to Chinese_Legacy & tell people to benchmark on the new pool instead.
(B) We make the tab dynamic: We could add subtabs corresponding to coverage levels of the pool, so that if the pool changes (but some datasets remain), the models just move lower on the coverage. It could e.g. be 100%, 75%, 50% corresponding to how many of the datasets in the pool we have eval scores for of the model. Then current models on the Chinese tab which we don't re-eval would now just only visible on the 75% & 50% subtabs going forward assuming that 75% of datasets in the pool stay the same.
(C) We expand the tab: Similar to (A), but just instead of retiring we add a new subtab & keep the old one. E.g. there'd be Overall > Chinese > Chinese v1 & Chinese v2.
Other changes, maybe not needed for this update:
8. Add two levels of tabs within each task tab such that it becomes Task -> Domain -> Language (to separate out Law & other things later)
9. Introduce a new MTEB English-only average that focuses on newer datasets or an MTEB Lite average that is easier to run
What do you think/disagree/agree with? Also, contributions on this should also give points I think - Maybe the reviewer decides the number ad-hoc for each PR? It'd be great if someone is willing to contribute on this 🙌
cc @tomaarsen @orionw @KennethEnevoldsen @imenelydiaker & anyone else interested 😊
Beta Was this translation helpful? Give feedback.
All reactions