MQA（多查询注意力）详解：原理与代码实现

创作时间:

作者:

@小白创作中心

MQA（多查询注意力）详解：原理与代码实现

引用

CSDN

https://blog.csdn.net/xiao_ling_yun/article/details/140846074

MQA（Multi-Query Attention）是Google团队在2019年提出的一种注意力机制，作为MHA（Multi-head Attention）的变体，主要用于自回归解码。与传统的MHA相比，MQA通过让所有Head共享同一份K和V矩阵，显著减少了参数量和显存占用，从而提升了推理速度，但可能会带来一定的精度损失。这种技术在大语言模型中得到了广泛应用，如ChatGLM2。

MQA与MHA的对比

传统的MHA是将输入划分为多个Head，并为每个Head独立计算注意力。在MHA中，Q、K、V会根据每个head做不同的转换（模拟：每个Head都有自己的感知域/parameter sets，可以独立学习输入中的不同特性）。这在Head数量较多时可能会存在计算密集的问题。

而与MHA不同的是，MQA让所有的Head之间共享同样的一份 K 和 V 矩阵（意味着K和V的计算唯一），只让 Q 保留了原始多头的性质（每个Head存在不同的转换），从而大大减少 K 和 V 矩阵的参数量以及KV Cache的显存占用，以此来达到提升推理速度，但是会带来精度上的损失。这种技术被大量应用于大预言模型，如ChatGLM2。

从代码角度来看，形式如下：

K_shared = WK * K
V_shared = WV * V
for i in range(num_heads):
    Qi = WQi * Q
    ...
    ...

MQA的具体实现

下面是一段MQA的代码实现，来自于huggingface的transformers包中的bertselfattention源码实现：

class MultiQuerySelfAttention(nn.Module):
    def __init__(self, num_attention_heads, hidden_size):
        super().__init__()
        self.num_attention_heads = num_attention_heads
        self.attention_head_size = int(hidden_size / num_attention_heads)
        self.all_head_size = self.num_attention_heads * self.attention_head_size
 
        self.query = nn.Linear(hidden_size, self.all_head_size)
        self.key = nn.Linear(hidden_size, self.attention_head_size)
        self.value = nn.Linear(hidden_size, self.attention_head_size)
 
        self.dropout = nn.Dropout(0.1)
 
    def transpose_for_scores(self, x: torch.Tensor) -> torch.Tensor:
        new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
        x = x.view(new_x_shape)
        return x.permute(0, 2, 1, 3)
 
    def forward(self,hidden_states):
        # hidden_states (B, L, D)
        mixed_query_layer = self.query(hidden_states)
        # query_layer  (B, h, L, d)
        # 在此处，将query划分为多头[batch_size, head_num, 序列长度, embedding长度]
        query_layer = self.transpose_for_scores(mixed_query_layer)
 
        # 每个key、value head参数都是共享的，只计算一次
        key = self.key(hidden_states)
        #key_layer  (B, 1, L, d)
        key_layer = key.unsqueeze(1)
        value = self.value(hidden_states)
        # value_layer  (B, 1, L, d)
        value_layer = value.unsqueeze(1)
 
        # key_layer  (B, 1, d, L)
        key_layer = key_layer.transpose(-1, -2)
        #广播算法 (B, h, L, d) * (B, 1, d, L) => (B, h, L, d) * (B, h, d, L) = (B, h, L, L)
        attention_scores = torch.matmul(query_layer, key_layer)
        attention_scores = attention_scores / math.sqrt(self.attention_head_size)
        attention_probs = nn.functional.softmax(attention_scores, dim=-1)
        attention_probs = self.dropout(attention_probs)
        #广播算法 (B, h, L, L) * (B, 1, L, d) =>(B, h, L, L) * (B, h, L, d)= (B, h, L, d)
        context_layer = torch.matmul(attention_probs, value_layer)
        #(B, h, L, d) => (B, L, h, d)
        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
        # (B,L, h*d) => (B,L,D)
        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
        # (B,L, h*d) => (B,L,D)
        context_layer = context_layer.view(new_context_layer_shape)
        return context_layer