Update attention.py #1

onisa-jr · 2024-02-15T10:45:11Z

if the the input does contain the batch dimension it will.. fail to perform dot product

knotgrass · 2024-02-28T06:41:52Z

Thank you to point out issue. Have you written test code for this problem yet?

onisa-jr · 2024-02-28T16:54:53Z

here the input shape will be [b, seq, d_model] and the same shape will be projected out afte multi head attention... if u carefully observe the attention static method it receives [b, h, s, dk], dk == d_model/n_heads

class Multi_head_Attention(nn.Module):
    """
    Multi-head self-attention mechanism.

    Args:
        dim (int): model dimension of the input vectors.
        num_heads (int, optional): Number of attention heads. Defaults to 8.
        dropout (float, optional): Dropout probability. Defaults to 0.1.

    """
    def __init__(self, dim, num_heads=8, dropout=0.1):
        super().__init__()
        assert dim % num_heads == 0
        self.num_heads = num_heads
        self.dropout = dropout
        self.dk = dim // num_heads

        # Linear transformations for key, query, and value
        self.key = nn.Linear(self.dk, self.dk, bias=True)
        self.query = nn.Linear(self.dk, self.dk, bias=True)
        self.value = nn.Linear(self.dk, self.dk, bias=True)

        # Final linear transformation
        self.wo = nn.Linear(dim, dim, bias=True)

    @staticmethod
    def attention(query, key, value, mask=None, dropout=0.1):
        d_k = query.shape[-1]
        # Just apply the formula from the paper
        # (batch, h, seq_len, d_k) --> (batch, h, seq_len, seq_len)
        attention_scores = (query @ key.transpose(-2, -1)) / math.sqrt(d_k)
        if mask is not None:
            # Write a very low value (indicating -inf) to the positions where mask == 0
            attention_scores.masked_fill_(mask == 0, -1e9)
        attention_scores = attention_scores.softmax(dim=-1) # (batch, h, seq_len, seq_len) # Apply softmax
        if dropout is not None:
            attention_scores = F.dropout(attention_scores, dropout)
        # (batch, h, seq_len, seq_len) --> (batch, h, seq_len, d_k)
        
        return attention_scores @ value

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Forward pass of the multi-head attention mechanism.

        Args:
            x (torch.Tensor): Input tensor of shape [batch_size, sequence_length, input_dim].

        Returns:
            torch.Tensor: Output tensor after multi-head attention, shape [batch_size, sequence_length, input_dim].

        """
        # Reshape input tensor to [batch_size, sequence_length, num_heads, dk]
        x = x.reshape(x.size(0), x.size(1), self.num_heads, self.dk).permute(0, 2, 1, 3)

        # Compute key, query, and value
        key = self.key(x)
        value = self.value(x)
        query = self.query(x)

        # Compute scaled dot-product attention
        att = Multi_head_Attention.attention(query, key, value)

        # Reshape attention tensor and apply final linear transformation
        attention = att.reshape(x.size(0), -1, self.num_heads * self.dk)
        return F.dropout(self.wo(attention), self.dropout)

Update attention.py

7e99a8c

if the the input does contain the batch dimension it will.. fail to perform dot product

knotgrass merged commit 3d54093 into knotgrass:main Oct 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update attention.py #1

Update attention.py #1

onisa-jr commented Feb 15, 2024

knotgrass commented Feb 28, 2024

onisa-jr commented Feb 28, 2024 •

edited

Loading

Update attention.py #1

Update attention.py #1

Conversation

onisa-jr commented Feb 15, 2024

knotgrass commented Feb 28, 2024

onisa-jr commented Feb 28, 2024 • edited Loading

onisa-jr commented Feb 28, 2024 •

edited

Loading