Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidate Retrieval/Reranking/Instruction Variants #1359

Merged
merged 19 commits into from
Nov 13, 2024

Conversation

orionw
Copy link
Contributor

@orionw orionw commented Oct 29, 2024

This PR would consolidate much of the logic between reranking/retrieval/instruction variants.

In principal, the only difference between these retrieval and reranking is that reranking is a bunch of "mini" retrieval tasks where the corpus is set by a different list.

The only difference between reranking and instruction reranking is the presence of an instruction, so this PR also merges those together also (we will have to designate instruction tasks via some list, but we need to do that anyways for #1066 and others)

Benefits are:

  • Enables reranking to use the full set of retrieval metrics, not just MAP and MRR.
  • Consolidates all the retrieval-style code (reranking, retrieval, instruction-based variants) hopefully making it easier to maintain and reducing duplicate code (e.g. we won't have three different places for retrieval eval code)
  • Makes it possible to use cross-encoders on reranking tasks (Cross-encoder and MTEB reranking tasks #1213)
  • Allows for instruction-based reranking tasks

Potential concerns/todos:

  • Using the same implementation for both retrieval and reranking is less computationally efficient since we don't batch over all queries at once. It should be roughly similar though on the corpus side, since we're still batching. It depends on if we're okay with this though @KennethEnevoldsen @Muennighoff @imenelydiaker @Samoed (and please tag others interested in reranking)? I don't remember reranking being particular compute intensive so perhaps this trade off is worth it? If not, I can revert some of this.
  • I still need to remove AbsTaskInstructionRetrieval and put the extras into the AbsTaskRetrieval class (+ adding some new datasets), but I thought I would leave that for the next PR.

If people are okay with this approach, I'll go through and verify that all the Reranking tasks still work and reproduce score-wise.

Checklist

  • [] Run tests locally to make sure nothing is broken using make test.
  • Run the formatter to format the code using make lint.

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the PR. I think the time concern is a reasonable trade-off for simplicity and maintainability, but @Muennighoff might know more about this given he made the original choice to split the two.

Have we checked that the new rerankers produce similar scores?

mteb/abstasks/AbsTaskReranking.py Outdated Show resolved Hide resolved
mteb/abstasks/AbsTaskReranking.py Outdated Show resolved Hide resolved
mteb/abstasks/AbsTaskReranking.py Outdated Show resolved Hide resolved
mteb/tasks/Reranking/multilingual/MIRACLReranking.py Outdated Show resolved Hide resolved
mteb/tasks/Reranking/multilingual/MIRACLReranking.py Outdated Show resolved Hide resolved
@Samoed
Copy link
Collaborator

Samoed commented Oct 30, 2024

Currently, retrieval tasks can't load multilingual datasets. I think this PR is a good opportunity to standardize this and re-upload them, as retrieval datasets have different loaders. For base we can take MIRACLE loader or CodeSearchNetCCRetrieval.

For example, in CMTEB, qrel_revision specifies loading qrels from a different repository. I can assist with re-uploading. Here I tried specified ConfigDict(extra="forbid") to find tasks with old unused metadata, but found a lot weird retrieval loaders (I commended out their irrelevant revisions to make mteb) #1362.

@orionw
Copy link
Contributor Author

orionw commented Oct 30, 2024

Thanks for the comments @KennethEnevoldsen and @Samoed! I'll go ahead and finish validating these / fixing MIRACL and turn this into a non-draft PR.

I think this PR is a good opportunity to standardize this and re-upload them, as retrieval datasets have different loaders.

@Samoed, when you say standardize, what do you mean? I think each task usually defines their way of loading right? Unless you mean have a default loader for this in the main codebase, so each file doesn't have to define their own custom loader?

@Samoed
Copy link
Collaborator

Samoed commented Oct 30, 2024

I think each task usually defines their way of loading right? Unless you mean have a default loader for this in the main codebase, so each file doesn't have to define their own custom loader?

Yes, currently only retrieval tasks don’t support multilingual loading by default and require custom data loaders.

@orionw
Copy link
Contributor Author

orionw commented Nov 6, 2024

An update on this. Everything is done, except for MindSmall, but perhaps I need to make another large change to fit it.

The reranking code has:

    for instance in self.samples:
        num_subqueries = (
            len(instance["query"]) if isinstance(instance["query"], list) else 1
        )
        query_emb = all_query_embs[query_idx : query_idx + num_subqueries]
query_emb: Query embedding of shape `(num_queries, hidden_size)`)
                if `num_queries` > 0: we take the closest document to any of the queries

where these subqueries only applies to the MindSmall reranking task.

The two scores (with or without subqueries) are actually pretty close if you ignore this step (only 0.1% off), but since it is in MTEB-original I wanted to make sure it hit parity.

The problem is that MindSmall has thousands of duplicate entries. So because my change makes it the "mini"-retrieval setting where each query sets up a mini retrieval task, it re-computes all the duplicates. With all the duplicates (1000x), it's simply not feasible to run the task.

I considered a solution where (like we do currently) we compute a unique set of queries and documents, but then share the retrieval evaluation code. However, this will compute the dot product on a lot more query-doc pairs than we need to.

I'm not sure I see a solution, so I will probably re-add the reranking evaluator and try to have it share as much code as possible with the retrieval one. EDIT: or potentially changing the DRESModel search to handle this case...

If anyone has other suggestions please let me know!

@KennethEnevoldsen
Copy link
Contributor

KennethEnevoldsen commented Nov 7, 2024

I think the best solution is to duplicate the evaluation code over to mindsmall reranking (and then leave it as is). We can then create a v2 of the mindsmall both without duplicates and using the standardized code. We can then update MTEB(eng, beta) to use v2.

This keeps parity, simplifies the code (moving complexity to a single case), and removes duplicates from a dataset.

@orionw orionw changed the title DRAFT: Consolidate Retrieval/Reranking/Instruction Variants Consolidate Retrieval/Reranking/Instruction Variants Nov 9, 2024
@orionw
Copy link
Contributor Author

orionw commented Nov 9, 2024

Okay, I think this is finally ready for review. Sorry it ended up so big, it was hard to break this into pieces.

A list of things changed:

  • Deleted InstructionRetrievalEvaluator, RerankingEvaluator, AbsTaskInstructionRetrieval and severely reduced AbstractTaskReranking
  • Made all retrieval-based tasks take the same main three configs: corpus, queries, relevant_docs with options for instructions (query-id->str) and top_ranked (which is reranking, query-id -> list of strings)
  • Removed a lot of older code around instructions, such as the keywords, short_instructions, save_corpus_embeddings, etc.
  • Made the reranking happen efficiently by changing the search function in DenseRetrievalExactSearch to handle reranking also, added torch compile to the similarity metrics if available (I saw a 2x speedup!), and added GPU-based cosine similarity (and dot product) to speed it up also. This makes MindSmall feasible. Without the GPU and compiling, MindSmall takes too long in the dot products - with it takes 30 minutes.
  • Allow for specialized metrics like p-MRR and Robustness@10 to be implemented in a way that doesn't affect all the other tasks, in utils.py
  • Refactored a bit of the evaluation code so that other custom metrics could use it when they do aggregation over the scores.
  • Added custom loaders for MIRACL and MindSmall so they load nicely until I have time to make them into clean datasets
  • Updated the tests to add InstructionReranking
  • Added a new InstructionRetrieval task, InstructIR and verified it matches the paper. Added contriever to the model section (including fixing a bug that didn't allow dot product similarity), so I could verify the InstructIR dataset.
  • Added a new Reranking task (NevIR) with a custom metric (paired accuracy) to make sure custom metrics work.
  • Moved FollowIR-based datasets into InstructionReranking which I think is a better fit
  • Refactored HFDataLoader into its own file
  • Refactored DRESModel and DenseRetrievalExtactSearch into a separate file.

I ran sentence-transformers/paraphrase-multilingual-mpnet-base-v2 on all of the reranking tasks, the instruction-based tasks, and some retrieval tasks to be sure that the scores match and they all do. There is some minor variation in the 4th decimal place (off by 0.0001 ish) due to the previous reranking code not having query-ids and so the sorting is a bit different on ties and such.

Eventually, I'd like to create new datasets on HF for the 15 ish reranking tasks, but I don't think I'll have time for that in the next couple weeks. It takes a few seconds to convert them on the fly for all tasks except MIRACL, MindSmall, and WebLINXCandidatesReranking which takes 5-10 minutes to convert on the fly.

@orionw orionw requested a review from Samoed November 9, 2024 23:38
Copy link
Collaborator

@Samoed Samoed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome changes!

@orionw orionw mentioned this pull request Nov 10, 2024
15 tasks
@isaac-chung
Copy link
Collaborator

@orionw wow! Great PR. I'll take a more detailed look shortly. Right now I only have a small thought on the removal of the Evaluators and AbsTask: removing these classes would be considered a breaking change, which should trigger a major version bump. We might want to do that only when we group a few other bigger impact changes together.

So my suggestion here is, instead of fully deleting them here and now, we could only add a warning to each of those classes that they are now deprecated, and will be removed in v2.0.0. That way if anyone had started using these will have a heads up, and we avoid 2 almost back-to-back major version bumps. Those classes will be fully removed at the v2.0.0 commit.

What do you think? also @KennethEnevoldsen @Samoed

@KennethEnevoldsen
Copy link
Contributor

What do you think? also @KennethEnevoldsen @Samoed

We could include these in v2.0.0 along with MIEB

(maybe we should start the v2.0.0 branch to collect some changes together?

@isaac-chung
Copy link
Collaborator

We could include these in v2.0.0 along with MIEB
(maybe we should start the v2.0.0 branch to collect some changes together?

Yeah, good call!

@KennethEnevoldsen KennethEnevoldsen changed the base branch from main to v2.0.0 November 11, 2024 09:28
@KennethEnevoldsen
Copy link
Contributor

Moved from main to v.2.0.0

@orionw
Copy link
Contributor Author

orionw commented Nov 11, 2024

Thanks @isaac-chung and good point!

So my suggestion here is, instead of fully deleting them here and now, we could only add a warning to each of those classes that they are now deprecated, and will be removed in v2.0.0

I agree with this. Unless v2.0.0 is soon, I think this will make it much easier for people to add instruction-based tasks and we should merge it in sooner than later. @KennethEnevoldsen do you know the timeline on that?

Happy to add back in some of these evaluators even if they are unused, just so we keep compatibility.

@isaac-chung
Copy link
Collaborator

@orionw @KennethEnevoldsen maybe we can merge this into the 2.0.0 preview branch when ready? That way if users want to try it out before the actual release they still can, and they won't be blocked by our timeline.

@orionw
Copy link
Contributor Author

orionw commented Nov 11, 2024

@isaac-chung we could do that too

I was mainly commenting from a "new-task-adder" perspective rather than a user perspective. New PR's like #1425 are adding new instruction-retrieval tasks and the current implementation of the InstructionRetrieval is quite confusing to add new tasks (which is totally my fault, as I added that setup earlier this year and didn't keep it broad enough).

If v2.0.0 is coming out within a month or so then perhaps it's fine to make those PRs also merge in to v2.0.0, but I was trying to avoid making it difficult for others to add because of my poor decisions :)

@KennethEnevoldsen
Copy link
Contributor

I believe this is a good fit for v2. Feel free to merge in into v2 as it is now (resolving the merge conflict). Please add a release note in the description over at #1433

@orionw orionw merged commit 9c58518 into v2.0.0 Nov 13, 2024
10 checks passed
@orionw orionw deleted the consolidate_reranking_and_retrieval branch November 13, 2024 16:30
@orionw orionw mentioned this pull request Nov 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants