Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug/Model Request]: intfloat/multilingual-e5-large should use average pooling #384

Open
ITHwang opened this issue Nov 1, 2024 · 3 comments

Comments

@ITHwang
Copy link

ITHwang commented Nov 1, 2024

What happened?

Hi, I'm using intfloat/multilingual-e5-large for a retrieval task and I found that when E5OnnxEmbedding embeds texts using the model, the model output is pooled by CLS-pooling.

class E5OnnxEmbedding(OnnxTextEmbedding):
    ...

class OnnxTextEmbedding(TextEmbeddingBase, OnnxTextModel[np.ndarray]):
    """Implementation of the Flag Embedding model."""
    ...

    def _post_process_onnx_output(self, output: OnnxOutputContext) -> Iterable[np.ndarray]:
        embeddings = output.model_output
        return normalize(embeddings[:, 0]).astype(np.float32)

But I think it would be better to use average pooling as the paper does when pretraining the model.

Following the popular biencoder architecture, we use a pre-trained Transformer encoder and average pooling over the output layer to get fixed-size text embeddings Eq and Ep. The score is the cosine similarity scaled by a temperature hyperparameter τ : ...

So I'm alternatively using the model that uses average pooling by overriding E5OnnxEmbedding:

def average_pool(last_hidden_states: np.ndarray, attention_mask: np.ndarray) -> np.ndarray:
    ...
    return avg_hidden

class CustomE5OnnxEmbedding(E5OnnxEmbedding):
    ...

    def _post_process_onnx_output(self, output: OnnxOutputContext) -> Iterable[np.ndarray]:
        embeddings, attention_masks = output.model_output, output.attention_mask

        pooled_embeddings = average_pool(embeddings, attention_masks)
        nomalized_embeddings = normalize(pooled_embeddings).astype(np.float32)

        return nomalized_embeddings

TextEmbedding.EMBEDDINGS_REGISTRY.append(CustomE5OnnxEmbedding)

Would you consider changing the pooling method to average pooling?

And separated with this, I'm really enjoying using FastEmbed and I appreciate your work on it!

Thanks for your time and consideration!

What Python version are you on? e.g. python --version

  • Python 3.11
  • FastEmbed 0.4.1

Version

0.2.7 (Latest)

What os are you seeing the problem on?

MacOS

Relevant stack traces and/or logs

No response

@ITHwang
Copy link
Author

ITHwang commented Dec 18, 2024

Could you guys give me some comment about my opinion?

@joein
Copy link
Member

joein commented Dec 18, 2024

Hey @ITHwang

Sorry for the delay, it seems that you're right and it was a mistake from our side
However, we might need some additional time to fix this, because of backward-compatibility.
We can't just silently replace one pooling with another, but we'll try to find a proper solution

Thank you for the kind words about Fastembed! :)

@ITHwang
Copy link
Author

ITHwang commented Dec 18, 2024

Thank you for your understanding!
I also think fixing this issue needs careful consideration as you said.
Looking forward to finding a good solution that can improve the functionality.
After that, I'd be glad to contribute to the solution as a heavy user 😀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants