Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PyTorch] Use or instead of and to combine swa mask with existing mask #1271

Closed
wants to merge 1 commit into from

Conversation

Marks101
Copy link
Contributor

Description

Hello team,

when creating the sliding window attention mask, the existing mask is currently combined with the SWA mask using a logical and. I think at this point an or should be used, because True means masked out in this part of the code and attention scores should be masked out if they are either masked out in the existing mask or in the SWA mask.

Please have a look and check if this makes sense 😄

Markus

Fixes # (issue)

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refractor

Changes

Please list the changes introduced in this PR:

  • change logical and to or in creation of sliding window attention mask

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@ksivaman
Copy link
Member

I think what's happening here is that the actual SWA mask is already created here before using the user provided attention_mask, which is then used as an additional filter using the logical_and and so the existing implementation would make sense. @cyanguwa Could you confirm?

@cyanguwa cyanguwa self-assigned this Oct 19, 2024
@Marks101
Copy link
Contributor Author

Hi @cyanguwa and @ksivaman,

sorry, I think my description was a little convoluted. I will explain my point based on an example.
For a padding mask as input, the following would be my expectation for get_swa_mask():

>>> import torch
>>> from transformer_engine.pytorch.attention import get_swa_mask
>>> attention_mask = torch.tensor([0, 0, 0, 0, 0, 1], dtype=bool, device="cuda")
>>> get_swa_mask((1, -1), 6, 6, "padding", attention_mask)[0]
tensor([[0, 1, 1, 1, 1, 1],
        [0, 0, 1, 1, 1, 1],
        [1, 0, 0, 1, 1, 1],
        [1, 1, 0, 0, 1, 1],
        [1, 1, 1, 0, 0, 1],
        [1, 1, 1, 1, 0, 1]], device='cuda:0')

But instead, this is currently the output:

>>> get_swa_mask((1, -1), 6, 6, "padding", attention_mask)[0]
tensor([[0, 0, 0, 0, 0, 1],
        [0, 0, 0, 0, 0, 1],
        [0, 0, 0, 0, 0, 1],
        [0, 0, 0, 0, 0, 1],
        [0, 0, 0, 0, 0, 1],
        [0, 0, 0, 0, 0, 0]], device='cuda:0')

No sliding window at all 🤷

@cyanguwa
Copy link
Collaborator

cyanguwa commented Oct 21, 2024

Hi Markus!

Yes, there's some issue with the logic of get_swa_mask(). I'm fixing it here: #1281. I changed a bit more than just logical_and() to logical_or(), and would like to have get_swa_mask() handle any combination of attn_mask_type and window_size. Could you take a look at that PR and let me know if there's any comments/testing issues please? Thanks!

@cyanguwa
Copy link
Collaborator

Let's handle this issue in that PR. I'll close this one for now. Thanks for always bringing up important issues and changes!

@cyanguwa cyanguwa closed this Oct 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants