用PaddlePaddle实战ViT:从零实现自注意力机制的图像分类
在深度学习领域,Transformer架构已经从自然语言处理领域成功跨界到计算机视觉领域。Vision Transformer(ViT)作为这一跨界的重要成果,正在改变我们对图像处理任务的认知方式。本文将带您用PaddlePaddle框架从零开始实现ViT模型,通过代码实践深入理解自注意力机制的工作原理。
1. 环境准备与数据预处理
在开始构建ViT模型前,我们需要准备好开发环境。PaddlePaddle作为百度开源的深度学习框架,提供了丰富的API支持Transformer模型的实现。
首先安装必要的库:
pip install paddlepaddle-gpu==2.4.0 pip install numpy pillow对于图像数据,ViT采用了一种特殊的处理方式——将图像分割为固定大小的patch。这与传统CNN处理整张图像的方式截然不同。我们来看一个具体的图像预处理示例:
import paddle import numpy as np from PIL import Image def load_and_preprocess_image(image_path, patch_size=16): # 加载图像并调整大小 img = Image.open(image_path).convert('RGB') img = img.resize((224, 224)) # ViT标准输入尺寸 # 转换为numpy数组并归一化 img_array = np.array(img).astype('float32') / 255.0 img_array = img_array.transpose([2, 0, 1]) # 转为CHW格式 # 分割为patch patches = [] for i in range(0, 224, patch_size): for j in range(0, 224, patch_size): patch = img_array[:, i:i+patch_size, j:j+patch_size] patches.append(patch) return paddle.to_tensor(np.array(patches))提示:ViT通常使用16×16的patch大小,这意味着224×224的图像将被分割为196个patch(224/16=14,14×14=196)
2. Patch Embedding实现
ViT的核心创新之一是将图像视为一系列patch的序列。这一步通过Patch Embedding层实现:
import paddle.nn as nn class PatchEmbedding(nn.Layer): def __init__(self, image_size=224, patch_size=16, in_channels=3, embed_dim=768): super().__init__() self.patch_size = patch_size self.num_patches = (image_size // patch_size) ** 2 # 使用卷积层实现patch分割和embedding self.proj = nn.Conv2D( in_channels=in_channels, out_channels=embed_dim, kernel_size=patch_size, stride=patch_size ) def forward(self, x): # x形状: [batch_size, channels, height, width] x = self.proj(x) # [batch_size, embed_dim, num_patches_h, num_patches_w] x = x.flatten(2) # [batch_size, embed_dim, num_patches] x = x.transpose([0, 2, 1]) # [batch_size, num_patches, embed_dim] return x这个实现有几个关键点:
- 使用卷积层同时完成patch分割和线性投影
- 输出形状为[batch_size, num_patches, embed_dim]
- embed_dim是每个patch被映射到的向量维度
3. 自注意力机制详解与实现
自注意力机制是Transformer的核心组件,理解它的实现对于掌握ViT至关重要。让我们先看看单头注意力的实现:
class Attention(nn.Layer): def __init__(self, embed_dim, num_heads=8, qkv_bias=False, dropout=0.): super().__init__() self.num_heads = num_heads self.head_dim = embed_dim // num_heads self.scale = self.head_dim ** -0.5 # 线性变换层,同时计算Q、K、V self.qkv = nn.Linear(embed_dim, embed_dim * 3, bias_attr=qkv_bias) self.attn_drop = nn.Dropout(dropout) self.proj = nn.Linear(embed_dim, embed_dim) self.proj_drop = nn.Dropout(dropout) def forward(self, x): B, N, C = x.shape # batch_size, num_patches, embed_dim # 计算Q、K、V qkv = self.qkv(x).reshape([B, N, 3, self.num_heads, self.head_dim]) qkv = qkv.transpose([2, 0, 3, 1, 4]) q, k, v = qkv[0], qkv[1], qkv[2] # 每个形状为[B, num_heads, N, head_dim] # 计算注意力分数 attn = (q @ k.transpose([0, 1, 3, 2])) * self.scale attn = nn.functional.softmax(attn, axis=-1) attn = self.attn_drop(attn) # 应用注意力权重到V上 x = (attn @ v).transpose([0, 2, 1, 3]).reshape([B, N, C]) x = self.proj(x) x = self.proj_drop(x) return x注意:缩放因子(scale)是自注意力机制稳定训练的关键,它等于1/√(head_dim),防止点积结果过大导致softmax梯度消失
为了更直观理解自注意力机制,我们可以可视化注意力权重:
import matplotlib.pyplot as plt def visualize_attention(image, attention_weights): plt.figure(figsize=(10, 10)) # 原始图像 plt.subplot(1, 2, 1) plt.imshow(image) plt.title("Original Image") # 注意力热图 plt.subplot(1, 2, 2) plt.imshow(attention_weights.mean(axis=0), cmap='hot') plt.title("Attention Heatmap") plt.colorbar() plt.show()4. 多头注意力与Transformer编码器
多头注意力通过并行多个注意力头来捕捉不同的特征关系。以下是多头注意力的实现:
class MultiHeadAttention(nn.Layer): def __init__(self, embed_dim, num_heads=8, qkv_bias=False, dropout=0.): super().__init__() self.attention = Attention(embed_dim, num_heads, qkv_bias, dropout) self.norm = nn.LayerNorm(embed_dim) def forward(self, x): # 残差连接 h = x x = self.norm(x) x = self.attention(x) x = x + h return x完整的Transformer编码器层还包括前馈网络和层归一化:
class TransformerEncoderLayer(nn.Layer): def __init__(self, embed_dim, num_heads, mlp_ratio=4., dropout=0.): super().__init__() self.attn = MultiHeadAttention(embed_dim, num_heads, qkv_bias=False, dropout=dropout) self.norm1 = nn.LayerNorm(embed_dim) self.norm2 = nn.LayerNorm(embed_dim) # MLP部分 hidden_dim = int(embed_dim * mlp_ratio) self.mlp = nn.Sequential( nn.Linear(embed_dim, hidden_dim), nn.GELU(), nn.Dropout(dropout), nn.Linear(hidden_dim, embed_dim), nn.Dropout(dropout) ) def forward(self, x): # 自注意力部分 h = x x = self.norm1(x) x = self.attn(x) x = x + h # 前馈网络部分 h = x x = self.norm2(x) x = self.mlp(x) x = x + h return x5. 构建完整的ViT模型
现在我们可以将所有组件组合起来构建完整的ViT模型:
class VisionTransformer(nn.Layer): def __init__(self, image_size=224, patch_size=16, in_channels=3, embed_dim=768, num_heads=12, depth=12, num_classes=1000, mlp_ratio=4., dropout=0.): super().__init__() # Patch Embedding self.patch_embed = PatchEmbedding(image_size, patch_size, in_channels, embed_dim) # 类别token和位置编码 self.cls_token = paddle.create_parameter( shape=[1, 1, embed_dim], dtype='float32', default_initializer=nn.initializer.Constant(0.) ) num_patches = self.patch_embed.num_patches self.pos_embed = paddle.create_parameter( shape=[1, num_patches + 1, embed_dim], dtype='float32', default_initializer=nn.initializer.TruncatedNormal(std=.02) ) self.pos_drop = nn.Dropout(dropout) # Transformer编码器 self.blocks = nn.LayerList([ TransformerEncoderLayer(embed_dim, num_heads, mlp_ratio, dropout) for _ in range(depth) ]) # 分类头 self.norm = nn.LayerNorm(embed_dim) self.head = nn.Linear(embed_dim, num_classes) def forward(self, x): B = x.shape[0] # Patch Embedding x = self.patch_embed(x) # [B, num_patches, embed_dim] # 添加类别token cls_tokens = self.cls_token.expand([B, -1, -1]) x = paddle.concat([cls_tokens, x], axis=1) # 添加位置编码 x = x + self.pos_embed x = self.pos_drop(x) # 通过Transformer编码器 for blk in self.blocks: x = blk(x) # 分类 x = self.norm(x) cls_token = x[:, 0] # 取类别token作为图像表示 x = self.head(cls_token) return x这个实现包含了ViT的所有关键组件:
- Patch Embedding层
- 可学习的类别token
- 位置编码
- 多层Transformer编码器
- 分类头
6. 模型训练与可视化分析
训练ViT模型需要特别注意学习率和优化器的选择:
def train_vit(model, train_loader, val_loader, epochs=50, lr=1e-4): # 定义损失函数和优化器 criterion = nn.CrossEntropyLoss() optimizer = paddle.optimizer.AdamW( learning_rate=lr, parameters=model.parameters(), weight_decay=0.05 ) # 学习率调度 scheduler = paddle.optimizer.lr.CosineAnnealingDecay( learning_rate=lr, T_max=epochs, verbose=True ) for epoch in range(epochs): model.train() for batch_idx, (data, target) in enumerate(train_loader): optimizer.clear_grad() output = model(data) loss = criterion(output, target) loss.backward() optimizer.step() scheduler.step() # 验证阶段 model.eval() val_loss = 0 correct = 0 with paddle.no_grad(): for data, target in val_loader: output = model(data) val_loss += criterion(output, target).item() pred = output.argmax(axis=1) correct += (pred == target).sum().item() val_loss /= len(val_loader) val_acc = 100. * correct / len(val_loader.dataset) print(f'Epoch {epoch+1}/{epochs}, Val Loss: {val_loss:.4f}, Val Acc: {val_acc:.2f}%')为了理解模型的工作原理,我们可以可视化注意力权重:
def visualize_model_attention(model, image_path): # 预处理图像 img = Image.open(image_path).convert('RGB') img_tensor = preprocess_image(img).unsqueeze(0) # 获取注意力权重 model.eval() with paddle.no_grad(): # 临时修改模型以返回注意力权重 attn_weights = [] def hook(module, input, output): attn_weights.append(output[1].cpu().numpy()) handles = [] for blk in model.blocks: handles.append(blk.attn.attention.register_forward_post_hook(hook)) _ = model(img_tensor) # 移除钩子 for h in handles: h.remove() # 可视化 plt.figure(figsize=(12, 6)) plt.subplot(1, 2, 1) plt.imshow(img) plt.title("Original Image") plt.subplot(1, 2, 2) mean_attn = np.mean(attn_weights[-1], axis=1)[0] # 最后一层的平均注意力 plt.imshow(mean_attn, cmap='hot') plt.title("Attention Heatmap (Last Layer)") plt.colorbar() plt.show()7. 实际应用中的技巧与优化
在实际项目中应用ViT时,有几个关键技巧可以提升模型性能:
- 学习率预热:ViT训练通常需要学习率预热
def create_optimizer(model, lr=1e-4, warmup_steps=10000): scheduler = paddle.optimizer.lr.LinearWarmup( learning_rate=lr, warmup_steps=warmup_steps, start_lr=1e-6, end_lr=lr ) optimizer = paddle.optimizer.AdamW( learning_rate=scheduler, parameters=model.parameters(), weight_decay=0.05 ) return optimizer, scheduler- 混合精度训练:可以显著减少显存占用并加速训练
scaler = paddle.amp.GradScaler(init_loss_scaling=1024) with paddle.amp.auto_cast(): output = model(data) loss = criterion(output, target) scaler.scale(loss).backward() scaler.step(optimizer) scaler.update()- 数据增强策略:对ViT特别重要
from paddle.vision.transforms import Compose, RandomResizedCrop, RandomHorizontalFlip, Normalize train_transform = Compose([ RandomResizedCrop(224, scale=(0.8, 1.0)), RandomHorizontalFlip(), Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) ])- 模型微调技巧:当在小数据集上微调预训练ViT时
# 冻结除分类头外的所有层 for name, param in model.named_parameters(): if "head" not in name: param.trainable = False # 使用更小的学习率 optimizer = paddle.optimizer.AdamW( learning_rate=1e-5, parameters=model.parameters(), weight_decay=0.01 )8. ViT变体与扩展应用
ViT的成功催生了许多改进版本,以下是几种值得关注的变体:
DeiT(Data-efficient Image Transformer):
- 通过知识蒸馏提高数据效率
- 更适合中小规模数据集
Swin Transformer:
- 引入层次化特征图
- 通过移位窗口计算降低复杂度
CrossViT:
- 结合不同尺度的patch
- 通过交叉注意力融合多尺度特征
MobileViT:
- 轻量化设计
- 结合CNN和Transformer的优势
在实际项目中,根据任务需求选择合适的ViT变体非常重要。例如,对于移动端应用,MobileViT可能是更好的选择;而对于计算资源充足的研究任务,Swin Transformer可能提供更好的性能。