大模型中常用的注意力機制GQA詳解以及Pytorch代碼實現

作者：Max Shap 2024-04-03 14:31:08

分組查詢注意力 (Grouped Query Attention) 是一種在大型語言模型中的多查詢注意力 (MQA) 和多頭注意力 (MHA) 之間進行插值的方法，它的目標是在保持 MQA 速度的同時實現 MHA 的質量。

這篇文章中，我們將解釋GQA的思想以及如何將其轉化為代碼。

GQA是在論文 GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints paper.中提出，這是一個相當簡單和干凈的想法，并且建立在多頭注意力之上。

GQA

標準多頭注意層(MHA)由H個查詢頭、鍵頭和值頭組成。每個頭都有D個維度。Pytorch的代碼如下：

from torch.nn.functional import scaled_dot_product_attention
 
 # shapes: (batch_size, seq_len, num_heads, head_dim)
 query = torch.randn(1, 256, 8, 64)
 key = torch.randn(1, 256, 8, 64)
 value = torch.randn(1, 256, 8, 64)
 
 output = scaled_dot_product_attention(query, key, value)
 print(output.shape) # torch.Size([1, 256, 8, 64])

對于每個查詢頭，都有一個對應的鍵。這個過程如下圖所示:

而GQA將查詢頭分成G組，每組共享一個鍵和值。可以表示為:

使用可視化的表示就能非常清楚的了解GQA的工作原理，就像我們上面說的那樣，GQA是一個相當簡單和干凈的想法

Pytorch代碼實現

讓我們編寫代碼將這種將查詢頭劃分為G組，每個組共享一個鍵和值。我們可以使用einops庫有效地執行對張量的復雜操作。

首先，定義查詢、鍵和值。然后設置注意力頭的數量，數量是隨意的，但是要保證num_heads_for_query % num_heads_for_key = 0，也就是說要能夠整除。我們的定義如下：

import torch
 
 # shapes: (batch_size, seq_len, num_heads, head_dim)
 query = torch.randn(1, 256, 8, 64)
 key = torch.randn(1, 256, 2, 64)
 value = torch.randn(1, 256, 2, 64)
 
 num_head_groups = query.shape[2] // key.shape[2]
 print(num_head_groups) # each group is of size 4 since there are 2 kv_heads

為了提高效率，交換seq_len和num_heads維度，einops可以像下面這樣簡單地完成:

from einops import rearrange
 
 query = rearrange(query, "b n h d -> b h n d")
 key = rearrange(key, "b s h d -> b h s d")
 value = rearrange(value, "b s h d -> b h s d")

然后就是需要在查詢矩陣中引入”分組“的概念。

from einops import rearrange
 query = rearrange(query, "b (h g) n d -> b g h n d", g=num_head_groups)
 print(query.shape) # torch.Size([1, 4, 2, 256, 64])

上面的代碼我們將二維重塑為二維：對于我們定義的張量，原始維度8(查詢的頭數)現在被分成兩組(以匹配鍵和值中的頭數)，每組大小為4。

最后最難的部分是計算注意力的分數。但其實它可以在一行中通過insum操作完成的

from einops import einsum, rearrange
 # g stands for the number of groups
 # h stands for the hidden dim
 # n and s are equal and stands for sequence length
  
 scores = einsum(query, key, "b g h n d, b h s d -> b h n s")
 print(scores.shape) # torch.Size([1, 2, 256, 256])

scores張量和上面的value張量的形狀是一樣的。我們看看到底是怎么操作的

einsum幫我們做了兩件事:

1、一個查詢和鍵的矩陣乘法。在我們的例子中，這些張量的形狀是(1,4,2,256,64)和(1,2,256,64)，所以沿著最后兩個維度的矩陣乘法得到(1,4,2,256,256)。

2、對第二個維度(維度g)上的元素求和——如果在指定的輸出形狀中省略了維度，einsum將自動完成這項工作，這樣的求和是用來匹配鍵和值中的頭的數量。

最后是注意分數與值的標準乘法:

import torch.nn.functional as F
 
 scale = query.size(-1) ** 0.5
 attention = F.softmax(similarity / scale, dim=-1)
 
 # here we do just a standard matrix multiplication
 out = einsum(attention, value, "b h n s, b h s d -> b h n d")
 
 # finally, just reshape back to the (batch_size, seq_len, num_kv_heads, hidden_dim)
 out = rearrange(out, "b h n d -> b n h d")
 print(out.shape) # torch.Size([1, 256, 2, 64])

這樣最簡單的GQA實現就完成了，只需要不到16行python代碼: