盤點目前最常用的四種語言模型壓縮技術
你能在不犧牲性能的情況下讓大型語言模型(LLM)變得更小?盡管人們總是對越來越大的語言模型感興趣,但MistralAI向我們展示了尺寸的重要性是相對的,而對邊緣計算日益增長的興趣促使我們用小型語言模型獲得不錯的結果。另一種方法是通過壓縮技術。在本文中,我將解釋這些技術,并提供一些簡單的代碼片段作為示例。
模型壓縮是在不損害其有效性的情況下最小化機器學習模型大小的行為。由于大型神經網絡經常因為過度參數化而包含冗余的計算單元,這種方法對它們是有效的。
壓縮意味著減少參數數量或整體內存占用,從而實現更小的模型大小(例如,從10GB減少到9GB)。這個過程有助于在存儲和推理速度方面提高模型的效率,使它們更容易部署在資源有限的環境中。常見的模型壓縮技術包括:
- 量化:通過改變模型權重(例如,從32位浮點數到8位整數)的精度來減少內存占用。
- 剪枝:移除不太重要的權重或神經元,減少參數數量。
- 知識蒸餾:訓練一個更小的模型(學生模型)來模仿一個更大的模型(教師模型),將知識蒸餾成具有類似性能的壓縮版本。
- 權重共享:在不同層之間使用共享權重來減少存儲需求,無論是通過設計還是在訓練后應用。
模型量化
模型量化通過改變權重或激活的精度表示(通常是32位或16位)來壓縮LLM,將其轉換為低精度表示(例如,8位、4位甚至二進制)。我們可以量化權重、激活函數或進行其他技巧:
- 權重量化:神經網絡使用的權重通常存儲為32位或16位浮點數。量化將這些權重減少到更低的位寬,如8位整數(INT8)或4位整數(INT4)。這是通過將原始權重范圍映射到具有較少位的較小范圍來實現的,顯著減少了內存使用。
- 激活量化:與權重類似,激活(推理期間層的輸出)可以被量化為更低的精度。通過用較少的位表示激活,減少了模型在推理期間的內存占用。
- 量化感知訓練(QAT):在QAT中,模型在模擬量化的同時進行訓練,允許它適應更低的精度。這有助于保持準確性,因為模型學會了對量化效應更加健壯(參見Tailor等人在Arxiv上的研究)。
- 訓練后量化(PTQ):這種方法涉及以全精度正常訓練模型,然后在此之后應用量化。雖然PTQ更簡單、更快,但與QAT相比,它可能導致準確性的更大下降(如Wang等人在NIPS2021上的研究)。
權重量化可以使用bitsandbytes輕松實現。安裝庫:
pip install torch transformers bitsandbytes
例如,對于GPT2運行以下代碼:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Specify the model you want to use
model_name = "gpt2" # You can replace this with any other LLM model
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Load the model with 8-bit quantization using bitsandbytes
model = AutoModelForCausalLM.from_pretrained(
model_name,
load_in_8bit=True, # Enable 8-bit quantization
device_map="auto" # Automatically allocate to available device (CPU/GPU)
)
# Example text for inference
input_text = "Weight Quantization is an efficient technique for compressing language models."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
# Generate text
with torch.no_grad():
output_ids = model.generate(input_ids, max_length=50)
# Decode and print the generated text
output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(output_text)
剪枝
剪枝移除不必要的或不太重要的權重、神經元或整個層,就像從樹上移除不必要的分支一樣。這減少了模型的大小,加快了推理速度,并降低了內存和計算需求,使其在盡可能保持原始性能的同時更加高效。
這比量化更直接,因為我們首先需要找到冗余的部分。例如,我們需要找到冗余的參數并在沒有它們的情況下微調模型。
最常見的是,我們移除權重、神經元或層,但對注意力頭剪枝(特定于基于Transformer的模型)作為一種結構化剪枝的興趣日益增長(參見Wang等人在Arxiv上的研究)。在這里,每個注意力層有多個頭。一些頭對模型性能的貢獻比其他頭更大,所以注意力頭剪枝移除了不太重要的頭。
剪枝的示例代碼可能如下,我們從GPT2模型中移除一定百分比的權重:
import torch
import torch.nn.utils.prune as prune
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load the pretrained model and tokenizer
model_name = "gpt2" # You can replace this with any other LLM model
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Define a pruning method (here we use L1 unstructured pruning)
def prune_model_layer(layer, amount=0.3):
# Prune 30% of the weights with the lowest L1 norm in the linear layers
for name, module in layer.named_modules():
if isinstance(module, torch.nn.Linear):
prune.l1_unstructured(module, name="weight", amount=amount)
print(f"Pruned layer {name} with amount {amount}")
# Apply pruning to all transformer layers in the model
for layer in model.transformer.h:
prune_model_layer(layer, amount=0.3) # Prune 30% of the weights
# Check the sparsity of the model
total_params = 0
pruned_params = 0
for name, module in model.named_modules():
if isinstance(module, torch.nn.Linear):
total_params += module.weight.nelement()
pruned_params += torch.sum(module.weight == 0).item()
print(f"Total parameters: {total_params}")
print(f"Pruned parameters: {pruned_params}")
print(f"Sparsity: {pruned_params / total_params:.2%}")
# Test the pruned model on a sample input
input_text = "Pruning is an effective way to compress language models."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids
# Generate text using the pruned model
with torch.no_grad():
output_ids = model.generate(input_ids, max_length=50)
# Decode and print the generated text
output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(output_text)
模型蒸餾
模型蒸餾是一種將“知識”從大型、更復雜的模型(稱為教師模型)轉移到小型、更簡單的模型(稱為學生模型)的技術,后者的參數更少。這個過程使得學生模型在保持更小的尺寸或速度的同時,能夠達到接近教師模型的性能,正如我們在開始時承諾的。
這個過程從一個大型的、預訓練的LLM開始,作為教師模型,例如GPT2或LLama。這個模型通常非常準確,但需要大量的計算資源來進行推理。
一個更小、更高效的模型(“學生模型”)被訓練來模仿教師模型的行為,如miniGPT2或TinyLlama(盡管Tinyllama是以不同的方式構建的)。學生模型從原始訓練數據和教師模型生成的輸出(軟標簽)中學習。
以下是Python中教師-學生互動的示例,從教師GPT2開始:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
import torch.nn.functional as F
# Load the teacher (large) and student (smaller) models
teacher_model_name = "gpt2" # You can replace this with any large LLM
student_model_name = "tiny-gpt2" # A smaller variant to act as the student
# Load the teacher model and tokenizer
teacher_model = AutoModelForCausalLM.from_pretrained(teacher_model_name).to("cuda")
teacher_tokenizer = AutoTokenizer.from_pretrained(teacher_model_name)
# Load the student model and tokenizer
student_model = AutoModelForCausalLM.from_pretrained(student_model_name).to("cuda")
student_tokenizer = AutoTokenizer.from_pretrained(student_model_name)
# Load a dataset for training (e.g., Wikitext for language modeling)
dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")
# Set training parameters
learning_rate = 5e-5
epochs = 3
optimizer = torch.optim.AdamW(student_model.parameters(), lr=learning_rate)
# Set temperature for softening probabilities
temperature = 2.0
alpha = 0.5 # Weighting factor for combining loss functions
# Training loop for knowledge distillation
for epoch in range(epochs):
for i, example in enumerate(dataset):
# Get the input text
input_text = example["text"]
# Skip empty lines
if not input_text.strip():
continue
# Tokenize the input text for the teacher and student models
teacher_inputs = teacher_tokenizer(input_text, return_tensors="pt", truncation=True, padding="max_length", max_length=32).to("cuda")
student_inputs = student_tokenizer(input_text, return_tensors="pt", truncation=True, padding="max_length", max_length=32).to("cuda")
# Get teacher predictions (soft labels)
with torch.no_grad():
teacher_outputs = teacher_model(**teacher_inputs)
teacher_logits = teacher_outputs.logits / temperature
teacher_probs = F.softmax(teacher_logits, dim=-1)
# Get student predictions
student_outputs = student_model(**student_inputs)
student_logits = student_outputs.logits
# Calculate distillation loss (Kullback-Leibler divergence)
distillation_loss = F.kl_div(
input=F.log_softmax(student_logits / temperature, dim=-1),
target=teacher_probs,
reduction="batchmean",
log_target=False
) * (temperature ** 2)
# Calculate student task loss (Cross-Entropy with true labels)
target_labels = student_inputs["input_ids"]
task_loss = F.cross_entropy(student_logits.view(-1, student_logits.size(-1)), target_labels.view(-1), ignore_index=student_tokenizer.pad_token_id)
# Combined loss
loss = alpha * distillation_loss + (1 - alpha) * task_loss
# Backpropagation and optimization
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Print training progress
if i % 100 == 0:
print(f"Epoch [{epoch + 1}/{epochs}], Step [{i}], Loss: {loss.item():.4f}")
print("Knowledge distillation completed!")
權重共享
通過在幾個模型組件之間共享參數,我們可以減少神經網絡的內存占用。當一些或所有層共享同一組權重而不是每層或組件都有獨特的權重時,模型必須保持的參數數量大大減少。人們可以預先定義模型的架構,使其具有共享權重,或者在訓練后將權重共享作為一種模型壓縮技術。例如,一種可能性是像下面的代碼一樣對權重進行聚類:
import torch
import numpy as np
from sklearn.cluster import KMeans
def apply_weight_sharing(model, num_clusters=16):
# Iterate through each parameter in the model
for name, param in model.named_parameters():
if param.requires_grad: # Only consider trainable parameters
# Flatten the weights into a 1D array for clustering
weights = param.data.cpu().numpy().flatten().reshape(-1, 1)
# Apply k-means clustering
kmeans = KMeans(n_clusters=num_clusters)
kmeans.fit(weights)
# Replace weights with their corresponding cluster centroids
cluster_centroids = kmeans.cluster_centers_
labels = kmeans.labels_
# Map the original weights to their shared values
shared_weights = np.array([cluster_centroids[label] for label in labels]).reshape(param.data.shape)
# Update the model's parameters with the shared weights
param.data = torch.tensor(shared_weights, dtype=param.data.dtype).to(param.device)
return model
# Example usage with a pre-trained model
from transformers import GPT2LMHeadModel
model = GPT2LMHeadModel.from_pretrained("gpt2")
model = apply_weight_sharing(model, num_clusters=16) # Apply weight sharing with 16 clusters
print("Weight sharing applied to the model!")